Implement prefix caching #95

EricLBuehler · 2024-04-08T22:22:28Z

Refs #91.

This PR implements prefix caching, a concept introduced in RadixAttention. It works by caching the KV cache resulting from running the model over the prompt (the "prefill step"), which is expensive. We implement a PrefixCacheManager which stores input prompt tokens with their prefilled KV caches and which avoids OOM by evicting caches.

Caching

The prefix cache manager holds all the KV caches from the prompts of previous sequences. Internally, this looks like a HashMap<token ids, kv cache>. The caching step is very simple and does not copy the KV cache, which is left on the device and only the KV cache Tensors are cloned.

Evicting

When the scheduler runs, it will check if there is enough memory for the KV cache for the next token to be allocated. If there is not enough memory, the prefix manager will evict some KV caches from device memory to the CPU.

Loading

When a sequence is received, it will check with the prefix cache manager to see if it can skip the prefill step based on its prompt.
Further step: if a prompt exists in the prefix cache manager which contains more tokens after the input prompt (that is, we can truncate the KV cache), we can still skip the prefill step.

github-actions · 2024-04-08T22:22:47Z

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        56     18710     1364       763    16583       1012
───────────────────────────────────────────────────────────────────────────────
Total                       56     18710     1364       763    16583       1012
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 15,515
Estimated Schedule Effort 10.692170 months
Estimated People Required 4.283430
───────────────────────────────────────────────────────────────────────────────
Processed 629442 bytes, 0.629 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

EricLBuehler · 2024-04-10T21:50:18Z

@lucasavila00, what do you think about this? It is ready for merging, but I wanted to get your opinion.

lucasavila00 · 2024-04-10T23:07:51Z

@lucasavila00, what do you think about this? It is ready for merging, but I wanted to get your opinion.

This looks amazing, thank you!

I read the code and my little understanding saw no issue.

I'll clone and run it.

lucasavila00 · 2024-04-10T23:45:44Z

I added a few dbg! and I can see that in Chat a second message doesn't use the KV cache of the first. I see in the PR body that it is the case, too.

Ideally this would be supported. When doing constrained generation of say Markdown, one needs to send the request with the # that marks the heading, then the model completes the heading, then re-send the prompt + heading for the model to generate a list and so on.

Like on this image of SGLang blog:

Every time it jumps forwards it doesn't re-compute the cached data.

Besides that, even when the cache was hit, I could not tell a performance difference on my system 🤔

EricLBuehler · 2024-04-11T00:21:32Z

Yes, it doesn't cache the entire KV cache of the previous request, only the prompt. Given sampling, I don't think that would be a good idea, but please let me know.

I have fixed the lack of performance improvement, it is now much faster!

lucasavila00 · 2024-04-11T01:07:06Z

Yes, it doesn't cache the entire KV cache of the previous request, only the prompt. Given sampling, I don't think that would be a good idea, but please let me know.

That would be good enough. In my example it'd re-calculate a handful of tokens every request. My understanding was that it wouldn't use any part of the cache if the prompt is not entirely cached.

"Further step: if a prompt exists in the prefix cache manager which contains more tokens after the input prompt (that is, we can truncate the KV cache), we can still skip the prefill step." I was wondering if it wouldn't be possible to also have it the other way around. If there is a cache for the beginning of the prompt, re-use that part too.

I have fixed the lack of performance improvement, it is now much faster!

Yes, it is for me too

EricLBuehler

Looks good.

Implement prefix cache manager

7312f55

EricLBuehler added new feature New feature or request optimization backend Backend work processing Processing related to the model labels Apr 8, 2024

EricLBuehler changed the title ~~Implement simple prefix cache manager~~ Implement prefix caching Apr 8, 2024

EricLBuehler added 6 commits April 10, 2024 14:24

Merge branch 'master' into prefix_caching

692206f

Implement adding seq to cacher and evicting

d893ebb

Implement prefix cache when verbatim

19764e4

Clippy lints

d771039

Implement more flexible prefix matching

9eec87e

Fix overflow range end

301602d

Handle prefill case in scheduler

2fde3bf

Fix setting prompt timestamp in pure prefill

b727cfa

Merge branch 'master' into prefix_caching

762f66f

EricLBuehler commented Apr 11, 2024

View reviewed changes

EricLBuehler merged commit 6d7466b into master Apr 11, 2024
11 checks passed

EricLBuehler deleted the prefix_caching branch April 11, 2024 11:16

EricLBuehler restored the prefix_caching branch April 15, 2024 10:08

EricLBuehler deleted the prefix_caching branch April 15, 2024 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement prefix caching #95

Implement prefix caching #95

EricLBuehler commented Apr 8, 2024 •

edited

Loading

github-actions bot commented Apr 8, 2024 •

edited

Loading

EricLBuehler commented Apr 10, 2024 •

edited

Loading

lucasavila00 commented Apr 10, 2024

lucasavila00 commented Apr 10, 2024 •

edited

Loading

EricLBuehler commented Apr 11, 2024

lucasavila00 commented Apr 11, 2024 •

edited

Loading

EricLBuehler left a comment

Implement prefix caching #95

Implement prefix caching #95

Conversation

EricLBuehler commented Apr 8, 2024 • edited Loading

github-actions bot commented Apr 8, 2024 • edited Loading

EricLBuehler commented Apr 10, 2024 • edited Loading

lucasavila00 commented Apr 10, 2024

lucasavila00 commented Apr 10, 2024 • edited Loading

EricLBuehler commented Apr 11, 2024

lucasavila00 commented Apr 11, 2024 • edited Loading

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler commented Apr 8, 2024 •

edited

Loading

github-actions bot commented Apr 8, 2024 •

edited

Loading

EricLBuehler commented Apr 10, 2024 •

edited

Loading

lucasavila00 commented Apr 10, 2024 •

edited

Loading

lucasavila00 commented Apr 11, 2024 •

edited

Loading