Vllm v1 eagle proposer #15346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

sroy745 wants to merge 17 commits into vllm-project:main from sroy745:vllm-v1-eagle-proposer

Collaborator

sroy745 commented Mar 23, 2025 •

edited by github-actions bot

Loading

No description provided.

sroy745 and others added 3 commits

March 20, 2025 20:31


          Add Eagle Proposer

61b3e6a


          Merge branch 'vllm-project:main' into main

6e21f0a


          Add Eagle proposer

127ca5e

sroy745 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners

March 23, 2025 01:12

sroy745 marked this pull request as draft

March 23, 2025 01:12

github-actions bot commented Mar 23, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify bot added the v1 label

sroy745 added 3 commits

March 23, 2025 01:14


          Fixes

768a738

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          More Fixes

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Fixes

9b1434e

Signed-off-by: Sourashis Roy <sroy@roblox.com>

WoosukKwon assigned LiuXiaoxuanPKU and WoosukKwon

sroy745 added 11 commits

March 24, 2025 06:47


          Fixes

5fba420

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Fixes

a5e0270

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Fixes

d947c4a


          Fixes

562de7d

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Fixes

5af4c2b

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Fixes

d377d11

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Fixes

9024d49

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Fixes

0046ff0

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Comment

aac04b5

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Comment

141b7cd

Signed-off-by: Sourashis Roy <sroy@roblox.com>


          Comment

3faebe4

Signed-off-by: Sourashis Roy <sroy@roblox.com>

WoosukKwon reviewed

View reviewed changes

vllm/v1/spec_decode/eagle_proposer.py

Comment on lines +26 to +32

+                          self, *, target_model_input_ids: Tensor,
+                          target_model_positions: Tensor, target_model_hidden_states: Tensor,
+                          target_model_seq_lens: list[int],
+                          sampled_token_ids: list[list[int]],
+                          next_prompt_token_ids: list[list[int]], is_prefill: list[bool],
+                          num_draft_tokens_to_propose: int,
+                          attention_metadata: FlashAttentionMetadata) -> list[SamplerOutput]:

Collaborator

WoosukKwon Mar 27, 2025

nit: please append , at the end of the input parameters and re-run the formatter, so that each input parameter can take a line. I really recommend this because otw adding/removing an input parameter can change the format again.

Suggested change

      
                        self, *, target_model_input_ids: Tensor,
          
                        target_model_positions: Tensor, target_model_hidden_states: Tensor,
          
                        target_model_seq_lens: list[int],
          
                        sampled_token_ids: list[list[int]],
          
                        next_prompt_token_ids: list[list[int]], is_prefill: list[bool],
          
                        num_draft_tokens_to_propose: int,
          
                        attention_metadata: FlashAttentionMetadata) -> list[SamplerOutput]:
          
                        self,
          
                        *,
          
                        target_model_input_ids: Tensor,
          
                        target_model_positions: Tensor,
          
                        target_model_hidden_states: Tensor,
          
                        target_model_seq_lens: list[int],
          
                        sampled_token_ids: list[list[int]],
          
                        next_prompt_token_ids: list[list[int]],
          
                        is_prefill: list[bool],
          
                        num_draft_tokens_to_propose: int,
          
                        attention_metadata: FlashAttentionMetadata,
          
                    ) -> list[SamplerOutput]:

WoosukKwon reviewed

View reviewed changes

vllm/v1/spec_decode/eagle_proposer.py

+                      Generates speculative draft token IDs using the Eagle model.
+                      This function aligns the Eagle model's KV cache with the target
+                      model’s output before generating speculative tokens. It first

Collaborator

WoosukKwon Mar 27, 2025

Suggested change

      
                    model’s output before generating speculative tokens. It first
          
                    model's output before generating speculative tokens. It first

LiuXiaoxuanPKU reviewed

View reviewed changes

Collaborator

LiuXiaoxuanPKU left a comment

Left some early partial comments, will finish one pass tomorrow.

vllm/v1/spec_decode/eagle_proposer.py

+                      """
+                      self._vllm_config = vllm_config
+                      self._model = model
+                      self._sampling_metadata = sampling_metadata

Collaborator

LiuXiaoxuanPKU Mar 26, 2025

why is this a field of eagle proposer? should it be passed in every proposing step?

vllm/v1/spec_decode/eagle_proposer.py

+                      Tokens: [T12, T13, T14, T15, T22, T23, T24, T32]
+                      Positions: [0, 1, 2, 3, 9, 10, 11, 44]
+                      Previous Hidden States: [H11, H12, H13, H14, H21, H22, H23, H31]
+                      Sampled Tokens: [[T16], [T25], [T33']]

Collaborator

LiuXiaoxuanPKU Mar 28, 2025

Minor: why is this example different from the example of input? Maybe just use the same example?

Collaborator

WoosukKwon Mar 28, 2025

It is the same example IIUC. The above one shows all the information including inputs and outputs, while this part only shows the inputs.

vllm/v1/spec_decode/eagle_proposer.py

+                      Note that for S1, we drop T11 (position 0). For S2 and S3,
+                      T21 and T31 are skipped since they were processed earlier.
+                      Eagle positions are always one less than the target model
+                      due to dropping the first token.

Collaborator

LiuXiaoxuanPKU Mar 28, 2025

Suggested change

      
                    due to dropping the first token.        
          
                    due to dropping the first token. For example, T12 has position 1 when running the traget model, while its position is 0 when running the Eagle head.

vllm/v1/spec_decode/eagle_proposer.py

+                              model.
+                          target_model_hidden_states: Hidden states from the target model.
+                          target_model_seq_lens: Sequence lengths in the target model.
+                          sampled_token_ids: Previously sampled/accepted tokens from the

Collaborator

LiuXiaoxuanPKU Mar 28, 2025

Suggested change

      
                        sampled_token_ids: Previously sampled/accepted tokens from the
          
                        sampled_token_ids: Generated tokens from the previous generation step.

vllm/v1/spec_decode/eagle_proposer.py

+                          next_prompt_token_ids: The next prompt token for a sequence if it
+                              is a partial prefill sequence and empty otherwise.
+                          is_prefill: Boolean flags indicating prefill sequences.
+                          num_draft_tokens_to_propose: Number of speculative tokens to

Collaborator

LiuXiaoxuanPKU Mar 28, 2025

A shorter name? num_spec_tokens?

vllm/v1/spec_decode/eagle_proposer.py

+                      # Determine expected sequence lengths in the Eagle model:
+                      # - For prefill sequences, lengths remain unchanged.
+                      # - For decoding sequences, lengths match the number of
+                      #   accepted tokens.

Collaborator

LiuXiaoxuanPKU Mar 28, 2025

What is expected sequence length? A bit more context?

WoosukKwon reviewed

View reviewed changes

vllm/v1/spec_decode/eagle_proposer.py

Comment on lines +49 to +52

+                      Tokens: [T11, T12, T13, T14, T21, T22, T23, T31, T32, T33]
+                      Positions: [0, 1, 2, 3, 9, 10, 11, 44, 45, 46]
+                      Hidden States: [H11, H12, H13, H14, H21, H22, H23, H31, H32, H33]
+                      Sampled Tokens: [[T15], [], [T32]]

Collaborator

WoosukKwon Mar 28, 2025

I actually don't get this. Where is T33 from?
Do you mean this?

Input Tokens: [[T11, T12, T13, T14], [T21, T22, T23], [T31, T32 (draft), T33 (draft)]]
Sampled Tokens: [[T15], [], [T32' (recovered token)]]

WoosukKwon reviewed

View reviewed changes

vllm/v1/spec_decode/eagle_proposer.py

+                          target_model_positions: Tensor, target_model_hidden_states: Tensor,
+                          target_model_seq_lens: list[int],
+                          sampled_token_ids: list[list[int]],
+                          next_prompt_token_ids: list[list[int]], is_prefill: list[bool],

Collaborator

WoosukKwon Mar 28, 2025

What does is_prefill mean?

Collaborator

WoosukKwon Mar 28, 2025

As you know, there's no "prefill" in V1.

WoosukKwon mentioned this pull request

[V1][Spec Decode] Implement Eagle Proposer [1/N] #15729

Merged

v-lmn commented Apr 1, 2025

hwo to load model by tp for eagle

mergify bot added the speculative-decoding label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

LiuXiaoxuanPKU LiuXiaoxuanPKU left review comments

WoosukKwon WoosukKwon left review comments

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat

njhill Awaiting requested review from njhill

ywang96 Awaiting requested review from ywang96

comaniac Awaiting requested review from comaniac

alexm-redhat Awaiting requested review from alexm-redhat

benchislett Awaiting requested review from benchislett benchislett will be requested when the pull request is marked ready for review benchislett is a code owner

luccafong Awaiting requested review from luccafong luccafong will be requested when the pull request is marked ready for review luccafong is a code owner

At least 1 approving review is required to merge this pull request.

Labels

speculative-decoding v1