Skip to content

Conversation

@LiuXiaoxuanPKU
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU commented Apr 10, 2025

Task 2 of #15901

Current change only touches the kv cache manager, the scheduler only changes its way of calling the allocate_slots.

I have not tested this PR yet, but I feel a bit hard to test, comments are appreciated. cc@WoosukKwon

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Apr 10, 2025
Comment on lines 198 to 199
num_new_tokens,
num_spec_tokens=self.num_spec_tokens)
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that num_new_tokens is needed for the verfication of the spec token ids by the target model from previous step and num_spec_tokens is the num of spec tokens that the draft model is supposed to generate at the end of this step.

Based on that, if num_new_tokens is 8 and num_spec_tokens is 4 so can end up allocating 1 block (16 tokens) such that 1 block shares both target model and draft model's KV cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is similar. My interpretation is that it temporarily acquires extra num_spec_tokens for draft tokens and it won't aggregate the size in the next iteration.

Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the same blocks are shared by target and draft models then will it not be an issue since the KVC of target and draft model are adjacent in the logical mapping of block tables so draft model will attend to KVC of the target?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm, I think it should not cause a problem as long as the actual starting kv cache slot for draft model is marked somehow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the discussion here!

  1. num_new_tokens is for verification, num_spec_tokens is for proposing heads. "Based on that, if num_new_tokens is 8 and num_spec_tokens is 4 so can end up allocating 1 block (16 tokens) such that 1 block shares both target model and draft model's KV cache?" --> yes, exactly.
  2. KV cache corruption: currently, kv cache is allocated independently but share the same slot mapping. We can think of kv cache as a map: {layer0_kv: [], layer1_kv: []...., layerk_kv, eagle_layer_kv:[]}. During each generation step, using the example above, we first verify tokens, which will write kv to layer0_kv...layerk_kv with slot mapping [0,1,2,3,4,5,6,7]. It will not write to the draft kv. If say only 2 tokens are accepted, 3 tokens are generated. In the proposing phase, we will send the three tokens to eagle proposer with slot mapping [0,1,2], which will populate the kv cache for the generated tokens, and also propose for the next token. We allocate 12 slots (8+4) in total, because it's possible that all tokens (with slot id 0-7) are accepted, in that case, proposing tokens need to write to kv cache with id [8,9,10,11].

Let me know if there is any confusion here!

Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense now. The block_table where the allocated slots get saved are shared across all layers and eagle is just a layer on top of the target model's layer. When we add blocks for num_new_tokens + num_spec_tokens then the target model will use just the num_new_tokens slots but in the case when all the drafts are accepted, draft layer will use the num_new_tokens + num_spec_tokens slots.

num_tokens: int,
new_computed_blocks: Optional[list[KVCacheBlock]] = None
new_computed_blocks: Optional[list[KVCacheBlock]] = None,
num_spec_tokens: int = 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two points to discuss:

  1. Should we use "num_lookahead_tokens" to reduce confusion? After all these slots are for the proposed tokens that will be verified in the next step.
  2. Should we consider these slots along with the preallocated blocks? Specially if preallocated blocks can cover spec tokens, then we don't need to allocate additional slots?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen this term lookahead_tokens before. Can you share why this is more general than spec_tokens? Is it because it can also mean jump tokens?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No jump tokens should be in new_tokens. I just feel num_spec_tokens is confusing because it actually means the spec tokens we're going to propose by the end of this step. However, we also have spec_tokens in Request, but that spec_tokens were generated by the last step for verification.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @comaniac I have the same two questions, too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I am good with num_lookahead_tokens, will change here.
  2. yeah sure, we can do a more conservation way,
    preallocated_blocks -= num_lookahead_tokens // block_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preallocated_blocks -= num_lookahead_tokens // block_size

We might have to revert this when num of draft tokens become large espc with tree attn since then num draft tokens ~= num preallocated tokens which would lead to frequent block allocations.

Comment on lines 218 to 220
num_required_blocks = cdiv(
num_computed_tokens + num_tokens + num_spec_tokens,
self.block_size)
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luyuzhe111 @wwl2755 - Moving the discussion of why this PR is expected to improve the AL.

I have a hypothesis. Without this PR, the queries in the draft can go out of bounds in the block_table and pick up incorrect address and value which will corrupt the answer. block_table is used in FA cuda kernels and maybe we dont check illegal memory address access.

Lets say page size is 16. This corruption will arise when have < K slots left in the last block. The preallocate block computation (extra 4 blocks) wont trigger in this case since the last block is not full. As K increases, the changes of this increases. So K=4 has higher chances of having this than K=2 which reflects here.

But then block_table is gathered here too to form the slot_mapping for queries so out of index should have given an error which it did not when using bs=1 with MTBench so I am not sure if above hypothesis is correct.

Lmk what you guys think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WoosukKwon @LiuXiaoxuanPKU - can you also share your insight as to why this PR is expected to increase AL?

Copy link
Contributor

@wwl2755 wwl2755 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Is the statement "this PR can increase AL" already benchmarked OR is it set up as a goal of this PR?

Copy link
Collaborator Author

@LiuXiaoxuanPKU LiuXiaoxuanPKU Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high level, if we don't have this PR, the current scheduler does not actually allocate slots for proposed tokens, they only allocate slots for verification. Therefore, it's not guaranteed the kv cache of the proposed heads is not contaminated.

Copy link
Contributor

@ekagra-ranjan ekagra-ranjan Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiuXiaoxuanPKU can you help us understand at a bit deeper level like which code line would be at fault?

My understanding is that If the scheduler doesn't allocate slots for the proposed tokens then torch should have thrown some error here when the new proposed tokens become the query? However, it didnt happen in our MTBench benchmark so probably there is no corruption without this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for asking! here will not trigger an error because block_table is always of a tensor of shape [batch_size, max_num_blocks_per_request], if those blocks are not allocated, the default values will be 0 in the block table.

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
@LiuXiaoxuanPKU LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 12, 2025
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. Also it would be good to have a simple unit test to evaluate the allocated slots.

Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
num_tokens=3,
num_lookahead_tokens=4,
)
assert len(blocks) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little bit weird to the common sense. When the num_lookahead_tokens increases (which means we may need more slots for allocation), the allocated blocks decrease.

In the test case 2, the num_lookahead_tokens does not use the slots for preallocate_tokens. Is there any particular reason why the design will be like this?

Copy link
Collaborator Author

@LiuXiaoxuanPKU LiuXiaoxuanPKU Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agree, I feel it's a conner case, num_lookahead_tokens does not use slots for preallocate_tokens because we calculate on the block level, and 3 // 4 = 0 (lookahead tokens borrow 0 block from preallocate tokens).

when num_lookahead_tokens=1, 2, 3, len(blocks) = 3
when num_lookahead_tokens=4, len(blocks) = 2
when num_lookahead_tokens=5, 6, 7...., len(blocks) = ceil((num_lookahead_tokens + 4 ) / 4) = 3 or bigger

number of required blocks = num of blocks required by lookahead slots + num of blocks required by compute tokens + num of preallocate blocks
num of preallocate blocks = max(0, Constant - num of blocks required by lookahead slots)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with num_lookahead_tokens=1,2,3, len(blocks) should stll be 2?

IIUC, the num_lookahead_token can used up the space for preallocate_tokens. So, basically before actually pre-allocate, we just need to check sure the last preallocate_blocks are already taken up by lookahead_tokens

May be the pseudocode could be something like:

num_required_blocks_before_lookahead = cdiv(
            num_computed_tokens + num_tokens,
            self.block_size)
num_required_blocks = cdiv(
            num_computed_tokens + num_tokens + num_lookahead_tokens,
            self.block_size)
num_required_blocks_used_by_lookahead = num_required_blocks - 
             num_required_blocks_before_lookahead

num_preallocate_blocks = max(
                0, self.num_preallocate_blocks -
                num_required_blocks_used_by_lookahead)
# if we find the preallocated blocks have been used up by lookahead, we don't need to further allocate them.

image

@LiuXiaoxuanPKU LiuXiaoxuanPKU merged commit f49e5af into vllm-project:main Apr 13, 2025
42 checks passed
@ekagra-ranjan
Copy link
Contributor

ekagra-ranjan commented Apr 14, 2025

I benchmarked for AL using the same setup (NOTE: yuhuili/EAGLE-LLaMA3-Instruct-8B and lmsys/sglang-EAGLE-LLaMA3-Instruct-8B have identical AL) . Combining it with the results for EAGLE official repo by @luyuzhe111 , we get this:

K=2:

  • without this PR: 1.89
  • with this PR: 1.90
  • EAGLE official repo: 2.0

K=4:

  • without this PR: 2.08
  • with this PR: 2.09
  • EAGLE official repo: 2.25

It was expected that this PR would close the gap bw vLLM and official EAGLE AL but it seems the gap is still there. Please share your thoughts on this. cc: @LiuXiaoxuanPKU @luyuzhe111 @wwl2755 @WoosukKwon

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 16, 2025

It was expected that this PR would close the gap bw vLLM and official EAGLE AL but it seems the gap is still there. Please share your thoughts on this. cc: @LiuXiaoxuanPKU @luyuzhe111 @wwl2755 @WoosukKwon

Then my suspection is there are some implementation flaws? I will take a closer look starting from the existing tests, including the proposing, sampling, and rejection.

@luyuzhe111
Copy link
Contributor

Hey @ekagra-ranjan @wwl2755,

I actually ran some the same AL benchmark the other day for meta-llama/Llama-3.1-8B-Instruct and yuhuili/EAGLE-LLaMA3.1-Instruct-8B instead of Llama 3.0 8B. The results are shown below. Basically, the gap between the EAGLE repo and vLLM v1 is actually small now. I haven't figured out why the gap is more pronounced for Llama 3.0 8B models though.

On MT Bench

When max number generated tokens = 256

Number of Speculated Tokens 1 2 3 4 5
EAGLE Repo 1.71 2.13 2.32 2.42 2.48
vLLM v0 1.70 2.06 2.20 2.29 2.32
vLLM v1 1.71 2.09 2.30 2.38 2.43

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 17, 2025

One thing I observed when I ran the benchmark in this PR (#16367) locally was that the results were not consistent between different runs. Sometimes K=4 gives 2.09, sometimes gives 2.10. Everything is default, temperature is 0, mt_bench is used, max_tokens is 256.

K=2:

  • without this PR: 1.89
  • with this PR: 1.90
  • EAGLE official repo: 2.0

K=4:

  • without this PR: 2.08
  • with this PR: 2.09
  • EAGLE official repo: 2.25

So, I'm thinking this 0.01 difference may not be led by this PR. That's saying, there should be gaps somewhere else.

Example command: VLLM_USE_V1=1 python examples/offline_inference/eagle.py --dataset /home/cc/vllm_benchmark_datatsets/question.jsonl --num_spec_tokens 4

cc: @ekagra-ranjan @luyuzhe111

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants