Support for KV caching and batched inference #1934

mseeger · 2025-02-06T09:37:15Z

Adds abstraction for key-value caches, implements batched inference.

I am also adding two baseline KV caches, the default one from before (all KV are stored) and a last-recent one.

The abstraction contains methods not used by these baselines, but they are required to implement more advanced KV caches such as Heavy Hitter Oracle (H2O).

I have implemented some of these, but I may not be allowed to contribute them here (working for a company). I'll see what I can do.

t-vi · 2025-02-06T10:47:06Z

Hey, great work @mseeger .

Can we decouple things a lot, though?

Some initial thoughts:

I would prefer if we kept the KVCache initialization as in the current version (i.e. that you initialize the model, then potentially adjust the max seq len and then initialize the KVCache) in this PR. Adding this to the init parameters seems orthogonal to the other changes.
We do have batched generation today. Can we please split changes to batched generation from the KVCache improvements. We probably don't want to do batching via lists of tensors. I'm currently looking at passing in "packed" input/input_pos sequences but these changes. Changing the existing tests should be a bit of a red flag, as it will screw existing users to change the API (we can do this if we need to, but TBH I am not convinced this is the case).
In general, can we be very conservative with adding arguments? For optional arguments, we should look into making them keyword-only unless there is a good reason not to.
We do keep control flow simple. self._default_kv_cache = False is not a good idea.
I'm not sure I understand the both_in_parallel. Maybe the right time to add it and the associated refactors is when they are used?
I'm generally a bit weary of the amount of data structures and cases that are being passed around here, those add a lot of complexity. To my mind, this likely means that the right abstraction has not yet been found. Maybe integrating KVCache and SDPA more could be a thing, but I am not sure.
In general, we do not want to do the cache setup during the forward. Please keep the initialization separate. I think we are rather seeing movement towards less of a distinction between pre-fill and next token, so this seems a bit in the wrong direction.

Again, super good stuff in the PR! I think there are a few things to split out and consider individually and then maybe we can have a video call about the core KVCache things, wdyt?

Thanks for the initiative for better KVCacheing!

mseeger · 2025-02-06T15:02:53Z

Hello, sure we can have a call, I am in the central Europe (Germany) time zone.

mseeger · 2025-02-06T15:05:33Z

My impression was that batched generation is not really there. But if it is, I don't ask to change it.

One thing is important through. KV caches really work by filling positions sequentially. So, you filled positions 0:(T-1), you need to continue with T, or with T:(T+k). The current API of just passing some position indexes is really not going to work.

mseeger · 2025-02-06T15:08:45Z

Also, the implementation right now allows you to send in KV cache objects from the start. If you do not do that, it will create them by default. This is done by set_kv_cache. If you also do not do that, it is done in the first forward with for_prefill=True.

Note that prefill here means that I can do a single pass, and the cache can take it all, without having to evict anything. It does not mean that this will encode even the shortest prompt in the batch. If prompts are longer than the max prefill length, you need to do it sequentially in chunks.

Maybe there is an easier way, we can discuss.

mseeger · 2025-02-06T15:10:01Z

It is annoying I cannot show you the KV cache code I have. But in a talk, I could explain why a few things are the way they are. Of course, I am not on top of other constraints you guys have.

mseeger · 2025-02-06T15:25:30Z

You may ask why KVCache.prefill? The main reason is that you want to use SDPA whenever you can, but SDPA cannot return the attention weights, which some KV cache algorithms (H2O) need in order to decide what to evict next.

We can do things so the very first call to the model, with input_pos=0, is doing this. So, instead of

model(x, for_prefill=True)

you'd call

model(x, input_pos=0)

This I could do. That would indeed be a little simpler.

mseeger · 2025-02-06T19:58:33Z

@t-vi Let me know what the next steps here should be. If I understand correctly, I could:

Get rid of for_prefill parameter, and use input_pos=0 instead
Don't create default KV cache in forward and rather fail the call if input_pos is used, s.t. user needs to call set_kv_cache
You don't seem to approve of passing the KV caches at construction (if user does not want to use default ones). Would you rather use set_kv_cache for that?

t-vi · 2025-02-09T09:50:37Z

Hi, so I think we should try to break things down.

We could either start with the core caching itself and try to see how to integrate it with minimal changes or see what is the deal with batching and prefill first.
I sent to your gmail address to find a good time to discuss.

mseeger · 2025-02-21T07:51:53Z

Hello @t-vi , let me try to break things down. Changes are these:

KVCache and its implementations. This replaces the default cache, which just stores everything. No behavior changes.
Caches for each layer can be passed when model is created. Before, there is set_kvcache, which creates the default
caches. If nothing is done at all, default caches are created when first needed. This is a change. Before, it would create
an exception.
Refactoring of generation code: This works for batch generation now, and single sequence generation is a special case.
Inside, this properly supports large prompts by splitting generation into prefill (as large as caches allow), and then
aequential blocks of desired length.

mseeger · 2025-02-21T07:54:18Z

If I understand you correctly, you complain about 2., especially the automatic creation of default cache when nothing is done, and the change of __init__ of GPT. This, I can work on. I could to the following:

Allow passing KV caches per layer in set_kvcache (or have another method?)
Create default KV caches by calling set_kvcache. If this is not done, calling forward for inference fails, so no cache is created automatically

Would that be what you prefer?

mseeger · 2025-02-21T07:56:07Z

As for 1. and 3., in the end, they go together, but I can try split it into two. I'd first do 1., keeping the generation code in place, which would however not work for batches and not support the sequential processing of prompts properly.

First doing 3. is not really sensible, because it requires things from 1.

What do you think?

mseeger · 2025-02-21T07:57:03Z

Note that with DeepSeek (I am involved trying to bring this to Hugging Face), there is a lot of movement now not to ignore KV caching in the future. They even released a paper now how they can train with large contexts.

mseeger requested review from lantiga and t-vi as code owners February 6, 2025 09:37

mseeger force-pushed the kvcache3 branch from ff817a9 to e27a445 Compare February 6, 2025 15:17

Support for KV caching and batched inference

30fbada

mseeger force-pushed the kvcache3 branch from e27a445 to 30fbada Compare February 8, 2025 17:24

mseeger added 2 commits February 14, 2025 10:48

Added test

7953793

Set enable_gqa flag in scaled_dot_product_attention

d2e9e45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for KV caching and batched inference #1934

Support for KV caching and batched inference #1934

mseeger commented Feb 6, 2025

t-vi commented Feb 6, 2025

mseeger commented Feb 6, 2025

mseeger commented Feb 6, 2025

mseeger commented Feb 6, 2025

mseeger commented Feb 6, 2025

mseeger commented Feb 6, 2025 •

edited

Loading

mseeger commented Feb 6, 2025

t-vi commented Feb 9, 2025

mseeger commented Feb 21, 2025

mseeger commented Feb 21, 2025 •

edited

Loading

mseeger commented Feb 21, 2025

mseeger commented Feb 21, 2025

Support for KV caching and batched inference #1934

Are you sure you want to change the base?

Support for KV caching and batched inference #1934

Conversation

mseeger commented Feb 6, 2025

t-vi commented Feb 6, 2025

mseeger commented Feb 6, 2025

mseeger commented Feb 6, 2025

mseeger commented Feb 6, 2025

mseeger commented Feb 6, 2025

mseeger commented Feb 6, 2025 • edited Loading

mseeger commented Feb 6, 2025

t-vi commented Feb 9, 2025

mseeger commented Feb 21, 2025

mseeger commented Feb 21, 2025 • edited Loading

mseeger commented Feb 21, 2025

mseeger commented Feb 21, 2025

mseeger commented Feb 6, 2025 •

edited

Loading

mseeger commented Feb 21, 2025 •

edited

Loading