-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for KV caching and batched inference #1934
base: main
Are you sure you want to change the base?
Conversation
Hey, great work @mseeger . Can we decouple things a lot, though? Some initial thoughts:
Again, super good stuff in the PR! I think there are a few things to split out and consider individually and then maybe we can have a video call about the core KVCache things, wdyt? Thanks for the initiative for better KVCacheing! |
Hello, sure we can have a call, I am in the central Europe (Germany) time zone. |
My impression was that batched generation is not really there. But if it is, I don't ask to change it. One thing is important through. KV caches really work by filling positions sequentially. So, you filled positions |
Also, the implementation right now allows you to send in KV cache objects from the start. If you do not do that, it will create them by default. This is done by Note that prefill here means that I can do a single pass, and the cache can take it all, without having to evict anything. It does not mean that this will encode even the shortest prompt in the batch. If prompts are longer than the max prefill length, you need to do it sequentially in chunks. Maybe there is an easier way, we can discuss. |
It is annoying I cannot show you the KV cache code I have. But in a talk, I could explain why a few things are the way they are. Of course, I am not on top of other constraints you guys have. |
You may ask why We can do things so the very first call to the model, with
you'd call
This I could do. That would indeed be a little simpler. |
@t-vi Let me know what the next steps here should be. If I understand correctly, I could:
|
Hi, so I think we should try to break things down. We could either start with the core caching itself and try to see how to integrate it with minimal changes or see what is the deal with batching and prefill first. |
Hello @t-vi , let me try to break things down. Changes are these:
|
If I understand you correctly, you complain about 2., especially the automatic creation of default cache when nothing is done, and the change of
Would that be what you prefer? |
As for 1. and 3., in the end, they go together, but I can try split it into two. I'd first do 1., keeping the generation code in place, which would however not work for batches and not support the sequential processing of prompts properly. First doing 3. is not really sensible, because it requires things from 1. What do you think? |
Note that with DeepSeek (I am involved trying to bring this to Hugging Face), there is a lot of movement now not to ignore KV caching in the future. They even released a paper now how they can train with large contexts. |
Adds abstraction for key-value caches, implements batched inference.
I am also adding two baseline KV caches, the default one from before (all KV are stored) and a last-recent one.
The abstraction contains methods not used by these baselines, but they are required to implement more advanced KV caches such as Heavy Hitter Oracle (H2O).
I have implemented some of these, but I may not be allowed to contribute them here (working for a company). I'll see what I can do.