-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement prefix caching #95
Conversation
Code Metrics Report─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── Rust 56 18710 1364 763 16583 1012 ─────────────────────────────────────────────────────────────────────────────── Total 56 18710 1364 763 16583 1012 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop 15,515 Estimated Schedule Effort 10.692170 months Estimated People Required 4.283430 ─────────────────────────────────────────────────────────────────────────────── Processed 629442 bytes, 0.629 megabytes (SI) ─────────────────────────────────────────────────────────────────────────────── |
@lucasavila00, what do you think about this? It is ready for merging, but I wanted to get your opinion. |
This looks amazing, thank you! I read the code and my little understanding saw no issue. I'll clone and run it. |
I added a few dbg! and I can see that in Chat a second message doesn't use the KV cache of the first. I see in the PR body that it is the case, too. Ideally this would be supported. When doing constrained generation of say Markdown, one needs to send the request with the Like on this image of SGLang blog: Every time it jumps forwards it doesn't re-compute the cached data. Besides that, even when the cache was hit, I could not tell a performance difference on my system 🤔 |
Yes, it doesn't cache the entire KV cache of the previous request, only the prompt. Given sampling, I don't think that would be a good idea, but please let me know. I have fixed the lack of performance improvement, it is now much faster! |
That would be good enough. In my example it'd re-calculate a handful of tokens every request. My understanding was that it wouldn't use any part of the cache if the prompt is not entirely cached. "Further step: if a prompt exists in the prefix cache manager which contains more tokens after the input prompt (that is, we can truncate the KV cache), we can still skip the prefill step." I was wondering if it wouldn't be possible to also have it the other way around. If there is a cache for the beginning of the prompt, re-use that part too.
Yes, it is for me too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Refs #91.
This PR implements prefix caching, a concept introduced in RadixAttention. It works by caching the KV cache resulting from running the model over the prompt (the "prefill step"), which is expensive. We implement a PrefixCacheManager which stores input prompt tokens with their prefilled KV caches and which avoids OOM by evicting caches.
Caching
HashMap<token ids, kv cache>
. The caching step is very simple and does not copy the KV cache, which is left on the device and only the KV cacheTensor
s are cloned.Evicting
Loading