Cache based tokenization for the server input prompts #12067

vnicolici · 2025-02-25T13:08:35Z

This modifies handle_completions_impl from examples/server/server.cpp to improve performance when the vocabulary based tokenization of the prompt differs slightly from the the generated tokens from the slot caches. This happens frequently during chat sessions, when generated tokens are converted to text, then the text back to tokens, causing mismatches and unnecessary reprocessing for parts of the prompt.

The issue where this is discussed: #11970

While I regard this mostly as a proof of concept, and I'm not 100% confident this won't cause other issues, I did test this during actual chat sessions for a few days, and it worked as expected. I also ran the full CI and it passed.

…ci/llama.cpp into cache-based-tokenization

ExtReMLapin · 2025-02-26T06:53:51Z

Confused a bit by title and description, is it just about caching tokenisation or bug fixing ?

Does tokenization REALLY needs caching ?

vnicolici · 2025-02-26T18:39:36Z

@ExtReMLapin

This code makes the prompt tokenization match the existing token sequences from the cache, as much as possible, during chat sessions. Without this, under certain conditions (mismatches between the tokenization during generation and the retokenization on the next prompt processing in chat) parts of the KV cache, following the mismatch, are discarded.

With this change, the parts of the KV cache that are currently discarded due to those mismatches can still be used and are no longer being discarded, because my code eliminates those mismatches (the prompt tokenization now matches the cache exactly, as long as the text of the prompt and the text corresponding to the tokens in the cache match). As a result, the time to first generated token during chats is reduced by my code in such situations.

About this performance issue being a bug or not, is still being debated under #11970 . Initially it looked more like a bug, but upon further discussion it looks more like an optimization issue.

As to the other question, if this patch is REALLY needed or not, the short answer is "I'm not sure." It helps me, under some circumstances, but I'm not sure how useful it will be in general. The long answer follows.

Unfortunately all this is not easy to clearly explain in just a few words, at I'm not able to do so, it's a very complex issue, so I'll have to give some examples to make this clearer.

Let's say you start a chat session, and a system prompt and a user prompt is sent to the server, then the server sends an assistant reply to the client.

So, the initial prompt sent to the server is something like:

<system_prompt> <user_prompt_A>

This prompt is tokenized using the vocabulary by the server, and it becomes something like this:

[SP1,SP2,...,SPN,  UPA1,UPA2,...,UPAN]

This sequence of tokens is then processed by the server, and after that tokens are generated for the assistant reply are generated. The cached sequence of tokens then becomes:

[SP1,SP2,...,SPN,  UPA1,UPA2,...,UPAN,  ARA1,ARA2,...,ARAN]

The client doesn't however have access to these token ID, during chats. The client receives just the text representation of the assistant answer, without the list of token IDs.

So, based on the previous prompt, the received assistant reply and the next user input in chat, the client constructs a new prompt, that looks like this:

<system_prompt> <user_prompt_A> <assistent_reply_A> <user_prompt_B>

Now, once the server receives this, it tokenizes it again, and produces something like:

[SP1,SP2,...,SPN,  UPA1,UPA2,...,UPAN,  ARA1,ARA2,...,ARAN,  UPB1,UPB2,...,UPBN]

Normally, at this point, the first 3 subsequences corresponding to the previous interaction in chat match the cache exactly, even with the existing unpatched code, so only the part of the token sequence, that is not cached (UPB1,UPB2,...,UPBN), representing the second user prompt, needs to be processed before starting to generate the second assistant reply.

But, unfortunately, that's not always the case, because sometimes, during the retokenization of the previous assistant answer - <assistent_reply_A> - the sequence of tokens generated by the retokenization can be different than the sequence of assistant generated tokens initially inferred (but the actual text corresponding to both sets of tokens is actually the same).

For example, instead of the sequence ARA1,ARA2,...,ARAN, that is in this example at the end of the cache, the re-tokenization could generate a longer token instead of ARA1,ARA2, let's call that token ARAL. Both But both [ARA1, ARA2] and [ARAL] produce the same text.

So, when this happens, what you now have is the cache looks like this:

[SP1,SP2,...,SPN,  UPA1,UPA2,...,UPAN,  ARA1,ARA2,...,ARAN]

And the tokenization of the prompt sent by the client, that looks like this:

[SP1,SP2,...,SPN,  UPA1,UPA2,...,UPAN,  ARAL,...,ARAN,  UPB1,UPB2,...,UPBN]

When deciding what to discard from the cache and reprocess as a prompt, the current code compares these two sequences of tokens, and finds the first mismatch, ARA1 != ARAL. As a result, the cache is truncated to the right at that position, so the cache becomes:

[SP1,SP2,...,SPN,  UPA1,UPA2,...,UPAN

This means that in this situation, instead of having to process as an additional prompt just the user input (UPB1,UPB2,...,UPBN), the system has to also fully reprocess the previous assistant generated text as a prompt (the new retokenized ARAL,...,ARAN sequence).

What my code does is to change the way the prompt tokenization works, so that instead of tokenizing the previous assistant output as ARAL,...,ARAN, using the vocabulary, it tokenizes like the corresponding sequence currently in the cache instead (ARA1,ARA2,...,ARAN), so there are no mismatches between the tokenized prompt and the cache, and the part of the cache corresponding to ARA1,ARA2,...,ARAN is not discarded from the cache anymore.

As to the actual performance impact of not doing this, it depends on a lot of factors, so it's hard to estimate the benefit of this patch. It can vary between making the chats up to 50% faster, and not having any influence at all.

Let's say that the assistant generated its previous response in 10 minutes. And let's say that, due to a tokenization mismatch in the middle of this assistant response, half of the cache corresponding to this assistant response has to be discarded when next user chat prompt is received. This means you lose the cache corresponding to 5 minutes of generating the assistant reply. But since prompt processing is usually about 10 times faster than inferring new tokens, the actual time lost is only 30 seconds.

This means, that in this situation, the user has to wait for an additional 30 seconds before new tokens start to be generated.

For simplicity, let's assume that in this example all 4 sequences (system prompt, first user prompt, the first assistant reply and the second user prompt all have the same length, in tokens. This means that the system prompt and the first user prompt took 60s+60s to process by the server. Then the server generated the first assistant reply in 10 minutes, so we are now at 12 minutes. The user enters the new prompt in chat, that takes another 60 seconds to process. We are now at 13 minutes, before the assistant starts to generate the second reply to the second user prompt.

Without my patch, there are those additional 30 seconds mentioned earlier, so it takes 13.5 minutes before the second assistant reply starts to be generated, so not that bad after all, just 4% slower. Still, from the user point of view, having to wait for 30 seconds before the system starts generating a new assistant reply is not ideal.

But it can be worse. If don't have enough memory and some disk paging/swapping takes place, inference is usually not negatively affected. However, prompt processing can become 10 times slower than usual, making it just as slow as inference, when the cache is not used. So, instead of waiting for just 30 seconds for the lost assistant tokens to be reprocessed and recached, you might wait 5 minutes, in this example instead of having the second assistant reply start after 13 minutes (with my patch), it will start generating tokens only after 18 minutes. Basically the user will stare at the chat window for 5 minutes, before the server starts generating the new reply to their new input. This is not theoretical, I did really experience this. That's when I started digging into this, trying to get rid of these long pauses during the chat.

This becomes even more complicated, when the DeepSeek model is involved. That's because, by default (as recommended by DeepSeek), the clients do not send the thinking back to the server as part of the prompt. And, as the thinking is between the previous user input and the assistant reply, this means that, by default, the entire assistant reply (except the thinking), has to be reprocessed anyway, as a prompt, since there will be a mismatch vs the cache which contains the thinking, and the prompt that doesn't contain the thinking.

So, in this case, with this thinking removal default, my patch doesn't help at all, as the entire assistant reply to the previous user input is discarded anyway upon the next user input, regardless of how that assistant reply is tokenized. However, since personally I choose not to use this recommended DeepSeek default, and not discard the thinking from the prompts, in my particular case this patch still helps.

Now, it gets even more complicated, as some models are more likely than others to generate tokens that do not retokenize in the same way when converted to text and back to tokens. The DeepSeek models are the worst offenders here, generating during inference sequences of multiple shorter tokens, instead of a single larger token that produces the same text. But, since with the default of removing thinking, this doesn't matter anyway, as you lose performance regardless of tokenization when the thinking is removed, it can be argued that it's not such a big issue and this patch is not REALLY needed.

However, not just DeepSeek models are affected. And, since non-thinking models do not have anything discarded, normally the entire cache can be used at each step of the chat. But, if you play with the parameters controlling creativity (temperature, top_k, top_p, min_p), attempting to significantly increase creativity, even the other non-thinking models can start to generate sequences of tokens that do not tokenize the same when converted to text and back to tokens, so the issue can still be reproduced.

But, again, it can be argued that few people change the defaults for the creativity settings by large amounts, so this is not such a big issue. That's why I'm still not sure if this is REALLY needed. If I was sure, I would have pushed more strongly for this to be included in the project.

I'll end with the stats from my latest test (without my patch), running the lmstudio-community/DeepSeek-R1-GGUF:Q8_0 model. I just bought a 768 GB RAM machine to be able to run this model. I just had a chat session, (with the option to discard the thinking disabled), and the prompt tokenization vs cache mismatches still caused reprocessing.

Actual required processing time: 4+775+1+242 = 1022 seconds.
The additional pause in useful processing induced by the cache mismatch: 75 seconds.
Basically, without my patch, the interaction was 7.3% longer than it should have been, overall.
If I look at just the last server reply, that was affected by the issue, the stats are even worse, as it made it take 33% more time than it would have taken with my patch.

Basically, I sent the server two messages during a chat session, and I received two replies. My initial prompt was 16 tokens long and took 4 seconds to process. The server produced a 844 token reply in 775 seconds. Then upon sending additional input in chat, which was just about 5 tokens or so in length, due to the tokenization mismatch the server did actually discard and reprocess an additional 788 tokens from its previous output as a prompt, which took 76 seconds . With my patch, I wouldn't have had to wait for these 76 seconds, the second response from the server would have started instantly. After those 76 seconds of pause, it then generated an additional 242 tokens in 229 seconds.

Sorry if I was slightly incoherent and rambling in this reply, but I was very busy at work and didn't sleep for over 34 hours so far. That being said, I'm not sure if I could have made this reply much better even with proper sleep.

ExtReMLapin · 2025-02-26T18:55:14Z

Thanks for the very long answer, I probably didn't deserve it.

I was very busy at work and didn't sleep for over 34 hours so far.

It's probably not worth it, you have one life, your company multiple employees, they're not worth it

vnicolici added 2 commits February 25, 2025 13:21

Hybrid tokenization of chat prompts based on the slot caches.

3dbdbca

Merge branch 'cache-based-tokenization' of https://github.com/vnicoli…

55f1bea

…ci/llama.cpp into cache-based-tokenization

vnicolici requested a review from ngxson as a code owner February 25, 2025 13:08

github-actions bot added examples server labels Feb 25, 2025

ngxson removed their request for review February 25, 2025 13:29

ngxson added the demo Demonstrate some concept or idea, not intended to be merged label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache based tokenization for the server input prompts #12067

Cache based tokenization for the server input prompts #12067

vnicolici commented Feb 25, 2025

ExtReMLapin commented Feb 26, 2025 •

edited

Loading

vnicolici commented Feb 26, 2025

ExtReMLapin commented Feb 26, 2025

Cache based tokenization for the server input prompts #12067

Are you sure you want to change the base?

Cache based tokenization for the server input prompts #12067

Conversation

vnicolici commented Feb 25, 2025

ExtReMLapin commented Feb 26, 2025 • edited Loading

vnicolici commented Feb 26, 2025

ExtReMLapin commented Feb 26, 2025

ExtReMLapin commented Feb 26, 2025 •

edited

Loading