-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Automatic Prefix Caching #2614
Comments
are you looking for any contributions to deliver this feature @zhuohan123 ? |
Yes, indeed. If anyone is interested, please let me know! |
@zhuohan123 Please let me know how I can help deliver this feature |
I am very interested. I will reachout to you on discord (and will look for you at the meetup wednesday) |
I am working on a PR for this |
Would this cache be reset for each batch or could it also be used to persistently cache across different batches/runs? I'm interested in low-latency applications and being able to incrementally append input while retaining the KV cache would be very useful |
@atyshka this feature will automatically enable reuse of the cache across different batches and run |
Thank you @zhuohan123 for your analysis.
We can rewrite it as
I made my analysis of attention sink #1304 (comment) and the main challenge is the necessity to move KV cache if you remove a token in the middle. |
@zhuohan123 we're wrapping up our PR for P0 of this RFC and we had a question about the prefix length that will be encoded in each block and used as a tie breaker for eviction.
|
It should be the last one. The goal is to identify the leaf nodes in the prefix tree. And the prefix length here should be the same as the tree depth. |
Any experiment results to show how prefix cache could improve the latency and throughput ? |
Since many people are going to read this, I want to add a few clarifications here. I do not see any items listed here that cannot be easily implemented in the tree structure.
In SGLang, we use TokenAttention (block_size/page_size = 1), which also simplifies many things. With optimized kernels flash-infer, it achieves the same performance as the larger block sizes. |
In LMDeploy TurboMind, we've implemented the Automatic Prefix Cache by incorporating both HashTable and RadixTree methods. The overall implementation is very straightforward, requiring no modifications to the kernel while achieving good compatibility with the existing framework features. Currently, since TurboMind does not support Sliding Window and LoRA, we have no plans for a second-phase optimization at this time. InternLM/lmdeploy#1450 |
@gawainx Big speedup in my testing with vLLM on 2x4090s running a 70B model. You just need to make sure it's compatible with your load balancer (if any) so requests get preferentially routed to the correct machine. RE @zhyncs' comment above, I'm seeing a massive speedup in my testing of LMDeploy, in part because it allows 8-bit and 4-bit KV caching in conjunction with prefix caching. Currently vLLM does not allow quantized cache in conjunction with prefix caching (nor does it allow chunked prefill with prefix caching, as an aside), so the cache can only store about eight 2000-token prefixes, vs about 32 in LMDeploy (with 4-bit cache). With LMDeploy, for 2000 token prompt + 100 tokens output, if prompt is cached, and limiting concurrency so that maximum time-to-first-token to 3 seconds for all requests, then end-to-end generated tokens/sec summed across all concurrent requests goes from ~10 to ~300 (!!) for a 70B Llama 2 model. For comparison, vLLM goes from ~10 to ~100 under the same assumptions. In a more realistic scenario (i.e. where not all prompts are cached), I'm expecting closer to a 4x speedup from LMDeploy, which is amazing. Definitely worth testing for your use case! I think all LMDeploy needs is to fix |
Is it hard to implement the FP8 KVCaching in vLLM? Can't they just take LMDeploy's implementation? |
This PR enables automatic prefix caching in intel gaudi HPUs. Please refer to this [RFC](vllm-project#2614) for detailed informations about prefix caching.
This RFC discusses our plan for implementing automatic prefix caching in vLLM.
High-level idea
We observe that every block in the KV cache can be uniquely identified by
With this, we can add another indirection in vLLM's KV cache management:
Then, all the sharing in vLLM, including sharing prefixes, can be achieved by logical blocks pointing to the block with the same hash value. Automatic prefix caching can be achieved by not freeing blocks with reference one in the KV cache. Specifically, this design enables us to manage the KV blocks as ordinary caches in operating systems.
We can maintain the following information in every block:
Then, for example, the following cache eviction policy will give the same policy as in RadixAttetion:
ref count == 0
.Major benefits of this design over a KV block Trie
Notes
Deliverables
P0
P1
P2
The text was updated successfully, but these errors were encountered: