Add Automatic Prefix Caching #2762

SageMoore · 2024-02-05T14:24:46Z

resolves #2614

The goal of this diff is to allow for automatic prefix caching. This is done by adding an additional level of indirection between the logical and physical blocks which allows for identical logical blocks to map to the same physical block.

This diff replaces the existing manual prefix caching mechanism added in #1669

Before:

Logical block table --> physical block table.

After

Logical block table --> hash table --> physical block table.

The BlockAllocator class now contains a hash table that maps to full PhysicalTokenBlocks which have already been computed and can be read from by multiple Sequences in parallel. This table is accessed when PhysicalTokenBlocks are allocated and, if the caller provides a hash value and that value is in the table, allocate will return a cached block instead of making a new one. If a hash value is not passed into allocate, a "unique" block will be generated using a timestamp as the hash value. This is primarily used to allocate new partial blocks.

The hash value passed into BlockAllocator is computed by the Sequence class. The Sequence's hash method takes in a logical block index and uses all tokens leading up to and including that block to compute a unique hash.

This caching system does not currently work for partial blocks, but there is a mechanism inside of the BlockSpaceAllocator class that will "promote" a partial block to a cacheable full block when the partial block fills up. At this point the BlockSpaceManager will use the sequence to compute the hash and that hash will be added to the BlockAllocator's table, making the block usable by other sequences.

There is an eviction system as well to manage PhysicalTokenBlocks coming in and out of the cache. Eviction is triggered on allocation when there are no more available PhysicalTokenBlocks. Only PhysicalTokenBlocks with a ref count of 0 are eligible for eviction. PhysicalTokenBlocks with a ref count of 0 can be "brought back" since they are not removed from the hash table until they are evicted.

The eviction policy has two "levels" to it. The first level is Least Recently Used. A timestamp is maintained inside of each PhysicalTokenBlock that denotes when that block was last used. The eviction function simply finds the oldest one and removes it from the cache. In the case where there are multiple PhysicalTokenBlocks that have the same last accessed time, the eviction function falls back to looking at the number of prefix tokens in that block. The PhysicalTokenBlock with the highest number of prefix tokens will be evicted first. If there are multiple blocks with the same number of prefix tokens, one is arbitrarily chosen.

robertgshaw2-redhat · 2024-02-06T14:26:58Z

vllm/worker/model_runner.py

-                prefix_block_tables.append(prefix.get_block_numbers())
-            else:
-                prefix_block_tables.append([])
+            prefix_block_tables.append([])


@SageMoore Was thinking more about this. Is there a separate spot in the code where we propagate the information about which prefixes where found in the hash table?

With manual prefix caching case, the prefix is passed by the user, and we set the context_lens (i.e. the length of the prefix) based on this. This metadata is then used by the model to only run forward on the new input_tokens and to use the kv_caches of the cached prefix during attention (e.g. calling context_forward_attn)

vllm/vllm/model_executor/layers/attention.py

Line 165 in 063d2fb

context_attention_fwd(

I wasn't sure if there is a separate spot in the code where this logic sits

This should be addressed now, thanks for catching.

…ogic

…lmagic-vllm into prefix-caching

jadielam · 2024-02-08T15:42:48Z

Related PR here: #2511
Not exact same work. This one bounds the cache to not run out of memory.

…heduler

…lmagic-vllm into prefix-caching

…th caching disabled

zhuohan123

LGTM! Thanks for the great contribution! I will submit some small style fixes in a separate PR.

shixianc · 2024-03-04T03:30:05Z

I have a probably dumb question:
Can anyone explain to me the difference between this CR vs. what's mentioned in the original vLLM paper - Page7 - "Shared prefix" Section:

In vLLM, this can be conveniently
achieved by reserving a set of physical blocks for a set of
predefined shared prefixes by the LLM service provider, as
how OS handles shared library across processes. A user in-
put prompt with the shared prefix can simply map its logi-
cal blocks to the cached physical blocks (with the last block
marked copy-on-write). The prompt phase computation only
needs to execute on the user’s task input.

I've been using vLLM with the belief that it already does automatic prefix caching between input prompts, but from your CR it apparently doesn't. What does the paper actually suggest then?

robertgshaw2-redhat · 2024-03-04T18:38:28Z

I have a probably dumb question: Can anyone explain to me the difference between this CR vs. what's mentioned in the original vLLM paper - Page7 - "Shared prefix" Section:
In vLLM, this can be conveniently
achieved by reserving a set of physical blocks for a set of
predefined shared prefixes by the LLM service provider, as
how OS handles shared library across processes. A user in-
put prompt with the shared prefix can simply map its logi-
cal blocks to the cached physical blocks (with the last block
marked copy-on-write). The prompt phase computation only
needs to execute on the user’s task input.
I've been using vLLM with the belief that it already does automatic prefix caching between input prompts, but from your CR it apparently doesn't. What does the paper actually suggest then?

In the paper, they were talking about beam search or generating n>1 samples for the same prompt. In each case, there is one prefill and many sequences generated. So vLLM shares the KVs across the many sequences generated for ONE REQUEST

This diff caches KVs automatically shares them ACROSS REQUESTS

…d multi-LoRA support (#1804) (#3263)

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

ilya-lavrenov · 2024-03-22T11:28:47Z

vllm/sequence.py

+    def hash_of_block(self, logical_idx: int) -> int:
+        # Compute the number of tokens in the sequence
+        num_tokens = self.num_hashed_tokens_of_block(logical_idx)
+        return hash(tuple(self.data.get_token_ids()[0:num_tokens]))


could you please explain why hash of block does not take into account its position? Multiple prompts combined from the same input id blocks have different meaning, because positional embedding is applied.

Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

SteveJonathon · 2024-04-02T12:40:46Z

QQ: Is there a schedule policy based on the longest prefix match, just like SGLang? Thanks,

matthieu-zimmer · 2024-04-02T14:37:05Z

@SageMoore will this also cache multi-turn/conversation queries?

Assuming the first request is A and that vllm generates B, if the second request is A+B+C, will A+B already be cached?

Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

SageMoore and others added 9 commits February 2, 2024 15:42

init

d1a91aa

Merge branch 'upstream-main' into prefix-caching

f27bfc8

Move evictor and eviction policy to a separate class

ec21130

format, replace match with if-else

73ab52c

shore up some of the eviction logic

76b5290

autoformat

fb9132b

Test block hashing

c84bbda

Format

be146c0

added block allocator tests

063d2fb

robertgshaw2-redhat reviewed Feb 6, 2024

View reviewed changes

SageMoore and others added 6 commits February 7, 2024 15:17

added timestamps to the PhysicalTokenBlock and updated the eviction l…

15099d2

…ogic

Delete the free hash table from the evictor class

9411e06

Remove the evictor class in favor of eviction free functions

359b829

debugging in progress

c9b0be6

Merge branch 'prefix-caching' of https://github.com/neuralmagic/neura…

cc80d1b

…lmagic-vllm into prefix-caching

partial block support

6218d1a

SageMoore and others added 13 commits February 8, 2024 16:08

Move PhysicalTokenBlock.last_accessed updates to the block_manager/sc…

b35819d

…heduler

Remove overly aggressive assert

38c1fc6

minor refactoring

b3e73f5

Add prefix len to eviction strategy

48624d9

Merge branch 'prefix-caching' of https://github.com/neuralmagic/neura…

9780ccd

…lmagic-vllm into prefix-caching

fixed a few bugs in the partial block management code

bb471f2

auto format

5d5db12

fix fork/cow mechanisms so that they work with partial blocks

ffbddd9

replace the partial block table with a simpler promotion mechanism

1f7fe42

clean up the BlockSpaceManager a bit

7ab75d7

fix minor typos

ca3e288

minor name change

ecf389d

update assert

427566a

SageMoore and others added 4 commits February 29, 2024 20:34

fix test_prefix_caching test

b9fbb66

fix minor perf regression

4ce8ceb

Only mark last prefix block as computed, assume no computed blocks wi…

11126ab

…th caching disabled

Merge branch 'main' into prefix-caching

e252bb6

zhuohan123 mentioned this pull request Mar 2, 2024

[v0.4.0] Release Tracker #3155

Closed

3 tasks

zhuohan123 added the v0.3.4 label Mar 2, 2024

zhuohan123 approved these changes Mar 2, 2024

View reviewed changes

zhuohan123 merged commit ce4f5a2 into vllm-project:main Mar 2, 2024
22 checks passed

This was referenced Mar 3, 2024

[Tests] Add block manager and scheduler tests #3108

Merged

Prefix caching and deallocation mechanism #2511

Open

[Server] Support openai prefix cache #2515

Closed

zhaoyang-star mentioned this pull request Mar 4, 2024

Introduce speculative decoding with draft models to vLLM #3029

Closed

shixianc mentioned this pull request Mar 4, 2024

Generation with Prefix-cache are slower than the ones without it ? #3154

Closed

cadedaniel mentioned this pull request Mar 7, 2024

[Speculative decoding 4/9] Lookahead scheduling for speculative decoding #3250

Merged

AlpinDale mentioned this pull request Mar 7, 2024

feat: naive context shift and various QoL changes aphrodite-engine/aphrodite-engine#289

Merged

jacobthebanana mentioned this pull request Mar 7, 2024

Automatic Prefix Caching (#2792) might conflict with multi-LoRA (#1804) #3264

Closed

Yard1 pushed a commit that referenced this pull request Mar 7, 2024

Possible fix for conflict between Automated Prefix Caching (#2762) an…

8cbba46

…d multi-LoRA support (#1804) (#3263)

AdrianAbeyta pushed a commit to AdrianAbeyta/vllm that referenced this pull request Mar 8, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

fd6e57e

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

ilya-lavrenov reviewed Mar 22, 2024

View reviewed changes

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Add Automatic Prefix Caching (vllm-project#2762)

ed8d24f

Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>

dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

12634be

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

hmellor mentioned this pull request Mar 28, 2024

Q: Support for session-aware cache eviction? #2265

Closed

hmellor mentioned this pull request Apr 4, 2024

Does vllm support KV-cache between multi-turn conversation #1371

Closed

Delviet mentioned this pull request Jun 1, 2024

[Bug]: Incorrect Example for the Inference with Prefix #5177

Closed

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Add Automatic Prefix Caching (vllm-project#2762)

e496ab5

Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Possible fix for conflict between Automated Prefix Caching (vllm-proj…

692c535

…ect#2762) and multi-LoRA support (vllm-project#1804) (vllm-project#3263)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Automatic Prefix Caching #2762

Add Automatic Prefix Caching #2762

SageMoore commented Feb 5, 2024 •

edited

Loading

robertgshaw2-redhat Feb 6, 2024 •

edited

Loading

mgoin Feb 16, 2024 •

edited

Loading

jadielam commented Feb 8, 2024

zhuohan123 left a comment

shixianc commented Mar 4, 2024

robertgshaw2-redhat commented Mar 4, 2024

ilya-lavrenov Mar 22, 2024

SteveJonathon commented Apr 2, 2024

matthieu-zimmer commented Apr 2, 2024

Add Automatic Prefix Caching #2762

Add Automatic Prefix Caching #2762

Conversation

SageMoore commented Feb 5, 2024 • edited Loading

robertgshaw2-redhat Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

mgoin Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

jadielam commented Feb 8, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

shixianc commented Mar 4, 2024

robertgshaw2-redhat commented Mar 4, 2024

ilya-lavrenov Mar 22, 2024

Choose a reason for hiding this comment

SteveJonathon commented Apr 2, 2024

matthieu-zimmer commented Apr 2, 2024

SageMoore commented Feb 5, 2024 •

edited

Loading

robertgshaw2-redhat Feb 6, 2024 •

edited

Loading

mgoin Feb 16, 2024 •

edited

Loading