[V1][Spec Decode] KV cache slots for eagle heads #16370

LiuXiaoxuanPKU · 2025-04-10T00:46:51Z

Task 2 of #15901

Current change only touches the kv cache manager, the scheduler only changes its way of calling the allocate_slots.

I have not tested this PR yet, but I feel a bit hard to test, comments are appreciated. cc@WoosukKwon

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

github-actions · 2025-04-10T00:47:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

ekagra-ranjan · 2025-04-10T01:51:02Z

vllm/v1/core/sched/scheduler.py

+                    num_new_tokens,
+                    num_spec_tokens=self.num_spec_tokens)


My understanding is that num_new_tokens is needed for the verfication of the spec token ids by the target model from previous step and num_spec_tokens is the num of spec tokens that the draft model is supposed to generate at the end of this step.

Based on that, if num_new_tokens is 8 and num_spec_tokens is 4 so can end up allocating 1 block (16 tokens) such that 1 block shares both target model and draft model's KV cache?

My understanding is similar. My interpretation is that it temporarily acquires extra num_spec_tokens for draft tokens and it won't aggregate the size in the next iteration.

If the same blocks are shared by target and draft models then will it not be an issue since the KVC of target and draft model are adjacent in the logical mapping of block tables so draft model will attend to KVC of the target?

Emmm, I think it should not cause a problem as long as the actual starting kv cache slot for draft model is marked somehow?

Thanks for the discussion here!

num_new_tokens is for verification, num_spec_tokens is for proposing heads. "Based on that, if num_new_tokens is 8 and num_spec_tokens is 4 so can end up allocating 1 block (16 tokens) such that 1 block shares both target model and draft model's KV cache?" --> yes, exactly.

KV cache corruption: currently, kv cache is allocated independently but share the same slot mapping. We can think of kv cache as a map: {layer0_kv: [], layer1_kv: []...., layerk_kv, eagle_layer_kv:[]}. During each generation step, using the example above, we first verify tokens, which will write kv to layer0_kv...layerk_kv with slot mapping [0,1,2,3,4,5,6,7]. It will not write to the draft kv. If say only 2 tokens are accepted, 3 tokens are generated. In the proposing phase, we will send the three tokens to eagle proposer with slot mapping [0,1,2], which will populate the kv cache for the generated tokens, and also propose for the next token. We allocate 12 slots (8+4) in total, because it's possible that all tokens (with slot id 0-7) are accepted, in that case, proposing tokens need to write to kv cache with id [8,9,10,11].

Let me know if there is any confusion here!

I think it makes sense now. The block_table where the allocated slots get saved are shared across all layers and eagle is just a layer on top of the target model's layer. When we add blocks for num_new_tokens + num_spec_tokens then the target model will use just the num_new_tokens slots but in the case when all the drafts are accepted, draft layer will use the num_new_tokens + num_spec_tokens slots.

comaniac · 2025-04-10T04:43:36Z

vllm/v1/core/kv_cache_manager.py

        num_tokens: int,
-        new_computed_blocks: Optional[list[KVCacheBlock]] = None
+        new_computed_blocks: Optional[list[KVCacheBlock]] = None,
+        num_spec_tokens: int = 0,


I have two points to discuss:

Should we use "num_lookahead_tokens" to reduce confusion? After all these slots are for the proposed tokens that will be verified in the next step.

Should we consider these slots along with the preallocated blocks? Specially if preallocated blocks can cover spec tokens, then we don't need to allocate additional slots?

I have seen this term lookahead_tokens before. Can you share why this is more general than spec_tokens? Is it because it can also mean jump tokens?

No jump tokens should be in new_tokens. I just feel num_spec_tokens is confusing because it actually means the spec tokens we're going to propose by the end of this step. However, we also have spec_tokens in Request, but that spec_tokens were generated by the last step for verification.

+1 to @comaniac I have the same two questions, too.

I am good with num_lookahead_tokens, will change here.

yeah sure, we can do a more conservation way,
preallocated_blocks -= num_lookahead_tokens // block_size

preallocated_blocks -= num_lookahead_tokens // block_size

We might have to revert this when num of draft tokens become large espc with tree attn since then num draft tokens ~= num preallocated tokens which would lead to frequent block allocations.

ekagra-ranjan · 2025-04-11T04:39:56Z

vllm/v1/core/kv_cache_manager.py

+        num_required_blocks = cdiv(
+            num_computed_tokens + num_tokens + num_spec_tokens,
+            self.block_size)


@luyuzhe111 @wwl2755 - Moving the discussion of why this PR is expected to improve the AL.

I have a hypothesis. Without this PR, the queries in the draft can go out of bounds in the block_table and pick up incorrect address and value which will corrupt the answer. block_table is used in FA cuda kernels and maybe we dont check illegal memory address access.

Lets say page size is 16. This corruption will arise when have < K slots left in the last block. The preallocate block computation (extra 4 blocks) wont trigger in this case since the last block is not full. As K increases, the changes of this increases. So K=4 has higher chances of having this than K=2 which reflects here.

But then block_table is gathered here too to form the slot_mapping for queries so out of index should have given an error which it did not when using bs=1 with MTBench so I am not sure if above hypothesis is correct.

Lmk what you guys think.

@WoosukKwon @LiuXiaoxuanPKU - can you also share your insight as to why this PR is expected to increase AL?

QQ: Is the statement "this PR can increase AL" already benchmarked OR is it set up as a goal of this PR?

From a high level, if we don't have this PR, the current scheduler does not actually allocate slots for proposed tokens, they only allocate slots for verification. Therefore, it's not guaranteed the kv cache of the proposed heads is not contaminated.

@LiuXiaoxuanPKU can you help us understand at a bit deeper level like which code line would be at fault?

My understanding is that If the scheduler doesn't allocate slots for the proposed tokens then torch should have thrown some error here when the new proposed tokens become the query? However, it didnt happen in our MTBench benchmark so probably there is no corruption without this PR?

Thanks for asking! here will not trigger an error because block_table is always of a tensor of shape [batch_size, max_num_blocks_per_request], if those blocks are not allocated, the default values will be 0 in the block table.

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

comaniac

Otherwise LGTM. Also it would be good to have a simple unit test to evaluate the allocated slots.

vllm/v1/core/kv_cache_manager.py

vllm/v1/core/sched/scheduler.py

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

wwl2755 · 2025-04-12T22:36:49Z

tests/v1/core/test_kv_cache_utils.py

+        num_tokens=3,
+        num_lookahead_tokens=4,
+    )
+    assert len(blocks) == 2


It seems a little bit weird to the common sense. When the num_lookahead_tokens increases (which means we may need more slots for allocation), the allocated blocks decrease.

In the test case 2, the num_lookahead_tokens does not use the slots for preallocate_tokens. Is there any particular reason why the design will be like this?

yeah agree, I feel it's a conner case, num_lookahead_tokens does not use slots for preallocate_tokens because we calculate on the block level, and 3 // 4 = 0 (lookahead tokens borrow 0 block from preallocate tokens).

when num_lookahead_tokens=1, 2, 3, len(blocks) = 3
when num_lookahead_tokens=4, len(blocks) = 2
when num_lookahead_tokens=5, 6, 7...., len(blocks) = ceil((num_lookahead_tokens + 4 ) / 4) = 3 or bigger

number of required blocks = num of blocks required by lookahead slots + num of blocks required by compute tokens + num of preallocate blocks
num of preallocate blocks = max(0, Constant - num of blocks required by lookahead slots)

I think with num_lookahead_tokens=1,2,3, len(blocks) should stll be 2?

IIUC, the num_lookahead_token can used up the space for preallocate_tokens. So, basically before actually pre-allocate, we just need to check sure the last preallocate_blocks are already taken up by lookahead_tokens

May be the pseudocode could be something like:

num_required_blocks_before_lookahead = cdiv( num_computed_tokens + num_tokens, self.block_size) num_required_blocks = cdiv( num_computed_tokens + num_tokens + num_lookahead_tokens, self.block_size) num_required_blocks_used_by_lookahead = num_required_blocks - num_required_blocks_before_lookahead num_preallocate_blocks = max( 0, self.num_preallocate_blocks - num_required_blocks_used_by_lookahead) # if we find the preallocated blocks have been used up by lookahead, we don't need to further allocate them.

ekagra-ranjan · 2025-04-14T16:56:03Z

I benchmarked for AL using the same setup (NOTE: yuhuili/EAGLE-LLaMA3-Instruct-8B and lmsys/sglang-EAGLE-LLaMA3-Instruct-8B have identical AL) . Combining it with the results for EAGLE official repo by @luyuzhe111 , we get this:

K=2:

without this PR: 1.89
with this PR: 1.90
EAGLE official repo: 2.0

K=4:

without this PR: 2.08
with this PR: 2.09
EAGLE official repo: 2.25

It was expected that this PR would close the gap bw vLLM and official EAGLE AL but it seems the gap is still there. Please share your thoughts on this. cc: @LiuXiaoxuanPKU @luyuzhe111 @wwl2755 @WoosukKwon

wwl2755 · 2025-04-16T03:05:13Z

It was expected that this PR would close the gap bw vLLM and official EAGLE AL but it seems the gap is still there. Please share your thoughts on this. cc: @LiuXiaoxuanPKU @luyuzhe111 @wwl2755 @WoosukKwon

Then my suspection is there are some implementation flaws? I will take a closer look starting from the existing tests, including the proposing, sampling, and rejection.

luyuzhe111 · 2025-04-16T18:51:17Z

Hey @ekagra-ranjan @wwl2755,

I actually ran some the same AL benchmark the other day for meta-llama/Llama-3.1-8B-Instruct and yuhuili/EAGLE-LLaMA3.1-Instruct-8B instead of Llama 3.0 8B. The results are shown below. Basically, the gap between the EAGLE repo and vLLM v1 is actually small now. I haven't figured out why the gap is more pronounced for Llama 3.0 8B models though.

On MT Bench

When max number generated tokens = 256

Number of Speculated Tokens	1	2	3	4	5
EAGLE Repo	1.71	2.13	2.32	2.42	2.48
vLLM v0	1.70	2.06	2.20	2.29	2.32
vLLM v1	1.71	2.09	2.30	2.38	2.43

wwl2755 · 2025-04-17T00:14:32Z

One thing I observed when I ran the benchmark in this PR (#16367) locally was that the results were not consistent between different runs. Sometimes K=4 gives 2.09, sometimes gives 2.10. Everything is default, temperature is 0, mt_bench is used, max_tokens is 256.

K=2:

without this PR: 1.89

with this PR: 1.90

EAGLE official repo: 2.0

K=4:

without this PR: 2.08

with this PR: 2.09

EAGLE official repo: 2.25

So, I'm thinking this 0.01 difference may not be led by this PR. That's saying, there should be gaps somewhere else.

Example command: VLLM_USE_V1=1 python examples/offline_inference/eagle.py --dataset /home/cc/vllm_benchmark_datatsets/question.jsonl --num_spec_tokens 4

cc: @ekagra-ranjan @luyuzhe111

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

slots for eagle

f42f7e3

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

LiuXiaoxuanPKU requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners April 10, 2025 00:46

mergify bot added the v1 label Apr 10, 2025

ekagra-ranjan reviewed Apr 10, 2025

View reviewed changes

comaniac reviewed Apr 10, 2025

View reviewed changes

luyuzhe111 mentioned this pull request Apr 10, 2025

[V1][Spec Decode] Eagle Model loading #16035

Merged

ekagra-ranjan reviewed Apr 11, 2025

View reviewed changes

LiuXiaoxuanPKU added 2 commits April 11, 2025 23:39

Merge branch 'main' into eagle-kv

281e577

fix comments

faf758b

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 12, 2025

fix tests

ddb0153

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

comaniac approved these changes Apr 12, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

add unit tests

84bf809

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

wwl2755 reviewed Apr 12, 2025

View reviewed changes

LiuXiaoxuanPKU merged commit f49e5af into vllm-project:main Apr 13, 2025
42 checks passed

This was referenced Apr 14, 2025

[Bug]: Spec decode not allocating lookahead req for req in WAITING queue #16612

Closed

[V1][Spec Decode][Bugfix] Allocate lookahead token kvc in WAITING queue #16613

Closed

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025

[V1][Spec Decode] KV cache slots for eagle heads (vllm-project#16370)

4d34785

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[V1][Spec Decode] KV cache slots for eagle heads (vllm-project#16370)

3c3aa69

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[V1][Spec Decode] KV cache slots for eagle heads (vllm-project#16370)

646d179

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

ekagra-ranjan mentioned this pull request May 7, 2025

[Benchmark][V1][Spec Decode][EAGLE] Tracking benchmark for V1 EAGLE #17812

Open

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1][Spec Decode] KV cache slots for eagle heads (vllm-project#16370)

e935695

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Uh oh!

[V1][Spec Decode] KV cache slots for eagle heads #16370

[V1][Spec Decode] KV cache slots for eagle heads #16370

Conversation

LiuXiaoxuanPKU commented Apr 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

ekagra-ranjan Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwl2755 Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiuXiaoxuanPKU Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiuXiaoxuanPKU Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ekagra-ranjan commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wwl2755 commented Apr 16, 2025

Uh oh!

luyuzhe111 commented Apr 16, 2025

Uh oh!

LiuXiaoxuanPKU commented Apr 10, 2025 •

edited by github-actions bot

Loading

ekagra-ranjan Apr 10, 2025 •

edited

Loading

ekagra-ranjan Apr 10, 2025 •

edited

Loading

ekagra-ranjan Apr 10, 2025 •

edited

Loading

ekagra-ranjan Apr 11, 2025 •

edited

Loading

wwl2755 Apr 11, 2025 •

edited

Loading

LiuXiaoxuanPKU Apr 11, 2025 •

edited

Loading

ekagra-ranjan Apr 11, 2025 •

edited

Loading

LiuXiaoxuanPKU Apr 13, 2025 •

edited

Loading

ekagra-ranjan commented Apr 14, 2025 •

edited

Loading