Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement prefix caching #95

Merged
merged 10 commits into from
Apr 11, 2024
Merged

Implement prefix caching #95

merged 10 commits into from
Apr 11, 2024

Conversation

EricLBuehler
Copy link
Owner

@EricLBuehler EricLBuehler commented Apr 8, 2024

Refs #91.

This PR implements prefix caching, a concept introduced in RadixAttention. It works by caching the KV cache resulting from running the model over the prompt (the "prefill step"), which is expensive. We implement a PrefixCacheManager which stores input prompt tokens with their prefilled KV caches and which avoids OOM by evicting caches.

Caching

  • The prefix cache manager holds all the KV caches from the prompts of previous sequences. Internally, this looks like a HashMap<token ids, kv cache>. The caching step is very simple and does not copy the KV cache, which is left on the device and only the KV cache Tensors are cloned.

Evicting

  • When the scheduler runs, it will check if there is enough memory for the KV cache for the next token to be allocated. If there is not enough memory, the prefix manager will evict some KV caches from device memory to the CPU.

Loading

  • When a sequence is received, it will check with the prefix cache manager to see if it can skip the prefill step based on its prompt.
  • Further step: if a prompt exists in the prefix cache manager which contains more tokens after the input prompt (that is, we can truncate the KV cache), we can still skip the prefill step.

@EricLBuehler EricLBuehler added new feature New feature or request optimization backend Backend work processing Processing related to the model labels Apr 8, 2024
@EricLBuehler EricLBuehler changed the title Implement simple prefix cache manager Implement prefix caching Apr 8, 2024
Copy link

github-actions bot commented Apr 8, 2024

Code Metrics Report
  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        56     18710     1364       763    16583       1012
───────────────────────────────────────────────────────────────────────────────
Total                       56     18710     1364       763    16583       1012
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 15,515
Estimated Schedule Effort 10.692170 months
Estimated People Required 4.283430
───────────────────────────────────────────────────────────────────────────────
Processed 629442 bytes, 0.629 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
  

@EricLBuehler
Copy link
Owner Author

EricLBuehler commented Apr 10, 2024

@lucasavila00, what do you think about this? It is ready for merging, but I wanted to get your opinion.

@lucasavila00
Copy link
Contributor

@lucasavila00, what do you think about this? It is ready for merging, but I wanted to get your opinion.

This looks amazing, thank you!

I read the code and my little understanding saw no issue.

I'll clone and run it.

@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 10, 2024

I added a few dbg! and I can see that in Chat a second message doesn't use the KV cache of the first. I see in the PR body that it is the case, too.

Ideally this would be supported. When doing constrained generation of say Markdown, one needs to send the request with the # that marks the heading, then the model completes the heading, then re-send the prompt + heading for the model to generate a list and so on.

Like on this image of SGLang blog:

image

Every time it jumps forwards it doesn't re-compute the cached data.

Besides that, even when the cache was hit, I could not tell a performance difference on my system 🤔

@EricLBuehler
Copy link
Owner Author

Yes, it doesn't cache the entire KV cache of the previous request, only the prompt. Given sampling, I don't think that would be a good idea, but please let me know.

I have fixed the lack of performance improvement, it is now much faster!

@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 11, 2024

Yes, it doesn't cache the entire KV cache of the previous request, only the prompt. Given sampling, I don't think that would be a good idea, but please let me know.

That would be good enough. In my example it'd re-calculate a handful of tokens every request. My understanding was that it wouldn't use any part of the cache if the prompt is not entirely cached.

"Further step: if a prompt exists in the prefix cache manager which contains more tokens after the input prompt (that is, we can truncate the KV cache), we can still skip the prefill step." I was wondering if it wouldn't be possible to also have it the other way around. If there is a cache for the beginning of the prompt, re-use that part too.

I have fixed the lack of performance improvement, it is now much faster!

Yes, it is for me too

Copy link
Owner Author

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@EricLBuehler EricLBuehler merged commit 6d7466b into master Apr 11, 2024
11 checks passed
@EricLBuehler EricLBuehler deleted the prefix_caching branch April 11, 2024 11:16
@EricLBuehler EricLBuehler restored the prefix_caching branch April 15, 2024 10:08
@EricLBuehler EricLBuehler deleted the prefix_caching branch April 15, 2024 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend work new feature New feature or request optimization processing Processing related to the model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants