[RFC] Automatic Prefix Caching #2614

zhuohan123 · 2024-01-26T08:09:07Z

This RFC discusses our plan for implementing automatic prefix caching in vLLM.

High-level idea

We observe that every block in the KV cache can be uniquely identified by

hash(prefix tokens, tokens in this block)

With this, we can add another indirection in vLLM's KV cache management:

Logical block table --> hash table --> physical block table.

Then, all the sharing in vLLM, including sharing prefixes, can be achieved by logical blocks pointing to the block with the same hash value. Automatic prefix caching can be achieved by not freeing blocks with reference one in the KV cache. Specifically, this design enables us to manage the KV blocks as ordinary caches in operating systems.

We can maintain the following information in every block:

Block's hash
Reference count
Last accessed time
Total access count
The prefix length of this block

Then, for example, the following cache eviction policy will give the same policy as in RadixAttetion:

Check the reference count first. Only evict the blocks with ref count == 0.
Then check the last accessed time. Prefer to free older blocks following LRU.
If the last accessed time is the same, check the prefix length. Free the one with longer prefix lengths first.

Major benefits of this design over a KV block Trie

Sometimes, caching is not limited to prefix caching:
- With Mistral's sliding window attention, we only need to cache the last tokens in the sliding window.
- With attention sinks, we need to cache the first few tokens and the latest tokens.
Maintaining hash table is simpler than maintaining a tree.
Extensible to more advanced caching policy (the one above is just an example).

Notes

An arbitrary caching policy may randomly free a block in the middle of a prefix. Then we need an attention kernel that can compute attention on sequences like the following: “ooooxxxxxooooxxxoooxxxooooxxx”, where we need to compute attention on all “x” tokens. This kernel can be implemented and is not required for the first version.
We would only cache the complete blocks, and we will keep partial blocks out of the hash table.

Deliverables

P0

Make every complete KV block cacheable. Do not immediately free KV blocks with ref count 0.
Implement the cache eviction policy above, with a good abstracted class on eviction policy.
- The cache eviction policy class should take a sequence of token IDs (or block hases), and return a list of blocks. Some of blocks can be already in the cache, and some of the blocks can be a new block that is just evicted by the policy.
Refactor the current block table to use hash:
- Add the attributes above to every block object in vLLM.
- For every request, keep the list of block objects as block table.
- [new] Implement global tables of blocks
  - Two tables: complete block table and partial block table.
  - When an incomplete block becomes a complete block, we need to merge it with an existing complete block or promote it to a new complete block.
The preemption policy is kept the same as before.

P1

Make sure the policy works for sliding windows.
Faster hash function

P2

Kernel optimization for "ooxxxoooxxxoooxxx" case.
Better preemption strategy for OOM cases.
Support block swapping.

The text was updated successfully, but these errors were encountered:

robertgshaw2-neuralmagic · 2024-01-26T13:55:57Z

are you looking for any contributions to deliver this feature @zhuohan123 ?

zhuohan123 · 2024-01-28T23:37:25Z

are you looking for any contributions to deliver this feature @zhuohan123 ?

Yes, indeed. If anyone is interested, please let me know!

zcnrex · 2024-01-29T02:46:58Z

@zhuohan123 Please let me know how I can help deliver this feature

robertgshaw2-neuralmagic · 2024-01-29T13:25:10Z

are you looking for any contributions to deliver this feature @zhuohan123 ?

Yes, indeed. If anyone is interested, please let me know!

I am very interested. I will reachout to you on discord (and will look for you at the meetup wednesday)

robertgshaw2-neuralmagic · 2024-01-29T13:35:28Z

I am working on a PR for this

atyshka · 2024-01-31T17:56:07Z

Would this cache be reset for each batch or could it also be used to persistently cache across different batches/runs? I'm interested in low-latency applications and being able to incrementally append input while retaining the KV cache would be very useful

robertgshaw2-neuralmagic · 2024-02-01T04:54:35Z

@atyshka this feature will automatically enable reuse of the cache across different batches and run

sh1ng · 2024-02-09T11:21:51Z

Thank you @zhuohan123 for your analysis.
Correct me if I'm wrong, but I don't see a match difference between a hash table and a tree if we consider KV blocks as leaves. IIUC it's what you are proposing by

hash(prefix tokens, tokens in this block)

We can rewrite it as

hash(parent_block_hash, tokens in this block)
parent_block_hash = 0 for the first block

I made my analysis of attention sink #1304 (comment) and the main challenge is the necessity to move KV cache if you remove a token in the middle.
We can also reserve a few blocks for system/reusable prompts. Attention sink may use a block filled solely with \n tokens.

SageMoore · 2024-02-12T15:43:34Z

@zhuohan123 we're wrapping up our PR for P0 of this RFC and we had a question about the prefix length that will be encoded in each block and used as a tie breaker for eviction.
Should this prefix length be:

The full prefix length of the Sequence.
The number of prefix tokens in the current block. I.E a max value of block.block_size
The number of prefix tokens before and including the current block? So each subsequent prefix block would have a higher value for block.prefix_length.

zhuohan123 · 2024-02-14T20:35:26Z

@zhuohan123 we're wrapping up our PR for P0 of this RFC and we had a question about the prefix length that will be encoded in each block and used as a tie breaker for eviction.

Should this prefix length be:

The full prefix length of the Sequence.

The number of prefix tokens in the current block. I.E a max value of block.block_size

The number of prefix tokens before and including the current block? So each subsequent prefix block would have a higher value for block.prefix_length.

It should be the last one. The goal is to identify the leaf nodes in the prefix tree. And the prefix length here should be the same as the tree depth.

gawainx · 2024-03-26T08:22:27Z

Any experiment results to show how prefix cache could improve the latency and throughput ?

merrymercy · 2024-05-16T22:55:42Z

Since many people are going to read this, I want to add a few clarifications here.

I do not see any items listed here that cannot be easily implemented in the tree structure.

For sliding window/attention sink, you can keep the metadata in the tree for matching and drop the GPU tensors for KV cache. You need to save the metadata anyway.
For multiple LoRA adapters, insert the ID of the LoRA adapter before the first token. Then it is done.
For image/video tokens, use the hash of the image/video as the key. It is already implemented in SGLang.
For advanced scheduling/eviction policy, having a tree structure can actually give you more information. You may want to evict a whole subtree or prioritize requests under a subtree. Having access to the tree structure makes the advanced scheduling policy possible.
With radix tree, you can make matching very efficient like this Optimize radix tree matching sgl-project/sglang#364

In SGLang, we use TokenAttention (block_size/page_size = 1), which also simplifies many things. With optimized kernels flash-infer, it achieves the same performance as the larger block sizes.

zhyncs · 2024-05-17T01:33:00Z

Since many people are going to read this, I want to add a few clarifications here.

I do not see any items listed here that cannot be easily implemented in the tree structure.

For sliding window/attention sink, you can keep the metadata in the tree for matching and drop the GPU tensors for KV cache. You need to save the metadata anyway.

For multiple LoRA adapters, insert the ID of the LoRA adapter before the first token. Then it is done.

For image/video tokens, use the hash of the image/video as the key. It is already implemented in SGLang.

For advanced scheduling/eviction policy, having a tree structure can actually give you more information. You may want to evict a whole subtree or prioritize requests under a subtree. Having access to the tree structure makes the advanced scheduling policy possible.

With radix tree, you can make matching very efficient like this Optimize radix tree matching sgl-project/sglang#364

In SGLang, we use TokenAttention (block_size/page_size = 1), which also simplifies many things. With optimized kernels flash-infer, it achieves the same performance as the larger block sizes.

In LMDeploy TurboMind, we've implemented the Automatic Prefix Cache by incorporating both HashTable and RadixTree methods. The overall implementation is very straightforward, requiring no modifications to the kernel while achieving good compatibility with the existing framework features. Currently, since TurboMind does not support Sliding Window and LoRA, we have no plans for a second-phase optimization at this time. InternLM/lmdeploy#1450

josephrocca · 2024-06-07T22:24:15Z

Any experiment results to show how prefix cache could improve the latency and throughput?

@gawainx Big speedup in my testing with vLLM on 2x4090s running a 70B model. You just need to make sure it's compatible with your load balancer (if any) so requests get preferentially routed to the correct machine.

RE @zhyncs' comment above, I'm seeing a massive speedup in my testing of LMDeploy, in part because it allows 8-bit and 4-bit KV caching in conjunction with prefix caching. Currently vLLM does not allow quantized cache in conjunction with prefix caching (nor does it allow chunked prefill with prefix caching, as an aside), so the cache can only store about eight 2000-token prefixes, vs about 32 in LMDeploy (with 4-bit cache).

With LMDeploy, for 2000 token prompt + 100 tokens output, if prompt is cached, and limiting concurrency so that maximum time-to-first-token to 3 seconds for all requests, then end-to-end generated tokens/sec summed across all concurrent requests goes from ~10 to ~300 (!!) for a 70B Llama 2 model. For comparison, vLLM goes from ~10 to ~100 under the same assumptions. In a more realistic scenario (i.e. where not all prompts are cached), I'm expecting closer to a 4x speedup from LMDeploy, which is amazing. Definitely worth testing for your use case! I think all LMDeploy needs is to fix stop param and add vLLM's include_stop_str_in_output, and it will be ready to deploy for most use cases.

AnaRhisT94 · 2024-07-21T08:38:41Z

Any experiment results to show how prefix cache could improve the latency and throughput?

@gawainx Big speedup in my testing with vLLM on 2x4090s running a 70B model. You just need to make sure it's compatible with your load balancer (if any) so requests get preferentially routed to the correct machine.

RE @zhyncs' comment above, I'm seeing a massive speedup in my testing of LMDeploy, in part because it allows 8-bit and 4-bit KV caching in conjunction with prefix caching. Currently vLLM does not allow quantized cache in conjunction with prefix caching (nor does it allow chunked prefill with prefix caching, as an aside), so the cache can only store about eight 2000-token prefixes, vs about 32 in LMDeploy (with 4-bit cache).

With LMDeploy, for 2000 token prompt + 100 tokens output, if prompt is cached, and limiting concurrency so that maximum time-to-first-token to 3 seconds for all requests, then end-to-end generated tokens/sec summed across all concurrent requests goes from ~10 to ~300 (!!) for a 70B Llama 2 model. For comparison, vLLM goes from ~10 to ~100 under the same assumptions. In a more realistic scenario (i.e. where not all prompts are cached), I'm expecting closer to a 4x speedup from LMDeploy, which is amazing. Definitely worth testing for your use case! I think all LMDeploy needs is to fix stop param and add vLLM's include_stop_str_in_output, and it will be ready to deploy for most use cases.

Is it hard to implement the FP8 KVCaching in vLLM? Can't they just take LMDeploy's implementation?

This PR enables automatic prefix caching in intel gaudi HPUs. Please refer to this [RFC](vllm-project#2614) for detailed informations about prefix caching.

zhuohan123 added RFC enhancement New feature or request labels Jan 26, 2024

zhuohan123 mentioned this issue Jan 26, 2024

Radix Attention support #2560

Closed

simon-mo pinned this issue Jan 26, 2024

zhuohan123 mentioned this issue Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

SageMoore mentioned this issue Feb 5, 2024

Add Automatic Prefix Caching #2762

Merged

zhuohan123 mentioned this issue Feb 19, 2024

Prefix Prompt Cache #2296

Closed

zhuohan123 closed this as completed in #2762 Mar 2, 2024

zhuohan123 unpinned this issue Mar 2, 2024

rkooo567 mentioned this issue Mar 3, 2024

Question: Does paged attention demonstrate prefix sharing? #2354

Open

zhuohan123 mentioned this issue Mar 3, 2024

Prefix caching and deallocation mechanism #2511

Open

hmellor mentioned this issue Mar 13, 2024

How to implement a disk cache? Or better ideas? #1495

Closed

AaronFriel mentioned this issue Mar 20, 2024

AI Controller Interface (AICI) integration #2888

Closed

This was referenced Apr 4, 2024

[Question] How to store/pass past-key-values? #681

Closed

Doubts about KV Cache and Obtaining LogProbs #2549

Closed

ispobock mentioned this issue Apr 7, 2024

Torch engine prefix caching InternLM/lmdeploy#1393

Closed

This was referenced Apr 7, 2024

Model grammar support via BNF EricLBuehler/mistral.rs#59

Closed

Prompt Caching EricLBuehler/mistral.rs#91

Closed

zhuohan123 mentioned this issue Apr 26, 2024

[Usage]: I doubt about the meaning of --enable-prefix-caching #4390

Closed

zhyncs mentioned this issue Jun 14, 2024

[Performance]: What can we learn from OctoAI #5167

Closed

huijjj mentioned this issue Aug 6, 2024

initial works on enabling automatic prefix caching HabanaAI/vllm-fork#162

Merged

miladm mentioned this issue Aug 16, 2024

[Feature]: Enable Prefix caching kernel on Pallas for TPU backend #7607

Open

ghost mentioned this issue Sep 28, 2024

Adding resources for LLMs sayantansatpati/awesome-data-science#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Automatic Prefix Caching #2614

[RFC] Automatic Prefix Caching #2614

zhuohan123 commented Jan 26, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jan 26, 2024 •

edited

Loading

zhuohan123 commented Jan 28, 2024

zcnrex commented Jan 29, 2024

robertgshaw2-neuralmagic commented Jan 29, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jan 29, 2024

atyshka commented Jan 31, 2024

robertgshaw2-neuralmagic commented Feb 1, 2024

sh1ng commented Feb 9, 2024

SageMoore commented Feb 12, 2024

zhuohan123 commented Feb 14, 2024

gawainx commented Mar 26, 2024

merrymercy commented May 16, 2024 •

edited

Loading

zhyncs commented May 17, 2024

josephrocca commented Jun 7, 2024 •

edited

Loading

AnaRhisT94 commented Jul 21, 2024

[RFC] Automatic Prefix Caching #2614

[RFC] Automatic Prefix Caching #2614

Comments

zhuohan123 commented Jan 26, 2024 • edited Loading

High-level idea

Major benefits of this design over a KV block Trie

Notes

Deliverables

robertgshaw2-neuralmagic commented Jan 26, 2024 • edited Loading

zhuohan123 commented Jan 28, 2024

zcnrex commented Jan 29, 2024

robertgshaw2-neuralmagic commented Jan 29, 2024 • edited Loading

robertgshaw2-neuralmagic commented Jan 29, 2024

atyshka commented Jan 31, 2024

robertgshaw2-neuralmagic commented Feb 1, 2024

sh1ng commented Feb 9, 2024

SageMoore commented Feb 12, 2024

zhuohan123 commented Feb 14, 2024

gawainx commented Mar 26, 2024

merrymercy commented May 16, 2024 • edited Loading

zhyncs commented May 17, 2024

josephrocca commented Jun 7, 2024 • edited Loading

AnaRhisT94 commented Jul 21, 2024

zhuohan123 commented Jan 26, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jan 26, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jan 29, 2024 •

edited

Loading

merrymercy commented May 16, 2024 •

edited

Loading

josephrocca commented Jun 7, 2024 •

edited

Loading