-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Vllm v1 eagle proposer #15346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Vllm v1 eagle proposer #15346
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
| self, *, target_model_input_ids: Tensor, | ||
| target_model_positions: Tensor, target_model_hidden_states: Tensor, | ||
| target_model_seq_lens: list[int], | ||
| sampled_token_ids: list[list[int]], | ||
| next_prompt_token_ids: list[list[int]], is_prefill: list[bool], | ||
| num_draft_tokens_to_propose: int, | ||
| attention_metadata: FlashAttentionMetadata) -> list[SamplerOutput]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please append , at the end of the input parameters and re-run the formatter, so that each input parameter can take a line. I really recommend this because otw adding/removing an input parameter can change the format again.
| self, *, target_model_input_ids: Tensor, | |
| target_model_positions: Tensor, target_model_hidden_states: Tensor, | |
| target_model_seq_lens: list[int], | |
| sampled_token_ids: list[list[int]], | |
| next_prompt_token_ids: list[list[int]], is_prefill: list[bool], | |
| num_draft_tokens_to_propose: int, | |
| attention_metadata: FlashAttentionMetadata) -> list[SamplerOutput]: | |
| self, | |
| *, | |
| target_model_input_ids: Tensor, | |
| target_model_positions: Tensor, | |
| target_model_hidden_states: Tensor, | |
| target_model_seq_lens: list[int], | |
| sampled_token_ids: list[list[int]], | |
| next_prompt_token_ids: list[list[int]], | |
| is_prefill: list[bool], | |
| num_draft_tokens_to_propose: int, | |
| attention_metadata: FlashAttentionMetadata, | |
| ) -> list[SamplerOutput]: |
| Generates speculative draft token IDs using the Eagle model. | ||
|
|
||
| This function aligns the Eagle model's KV cache with the target | ||
| model’s output before generating speculative tokens. It first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| model’s output before generating speculative tokens. It first | |
| model's output before generating speculative tokens. It first |
LiuXiaoxuanPKU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some early partial comments, will finish one pass tomorrow.
| """ | ||
| self._vllm_config = vllm_config | ||
| self._model = model | ||
| self._sampling_metadata = sampling_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a field of eagle proposer? should it be passed in every proposing step?
| Tokens: [T12, T13, T14, T15, T22, T23, T24, T32] | ||
| Positions: [0, 1, 2, 3, 9, 10, 11, 44] | ||
| Previous Hidden States: [H11, H12, H13, H14, H21, H22, H23, H31] | ||
| Sampled Tokens: [[T16], [T25], [T33']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: why is this example different from the example of input? Maybe just use the same example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same example IIUC. The above one shows all the information including inputs and outputs, while this part only shows the inputs.
| Note that for S1, we drop T11 (position 0). For S2 and S3, | ||
| T21 and T31 are skipped since they were processed earlier. | ||
| Eagle positions are always one less than the target model | ||
| due to dropping the first token. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| due to dropping the first token. | |
| due to dropping the first token. For example, T12 has position 1 when running the traget model, while its position is 0 when running the Eagle head. |
| model. | ||
| target_model_hidden_states: Hidden states from the target model. | ||
| target_model_seq_lens: Sequence lengths in the target model. | ||
| sampled_token_ids: Previously sampled/accepted tokens from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sampled_token_ids: Previously sampled/accepted tokens from the | |
| sampled_token_ids: Generated tokens from the previous generation step. |
| next_prompt_token_ids: The next prompt token for a sequence if it | ||
| is a partial prefill sequence and empty otherwise. | ||
| is_prefill: Boolean flags indicating prefill sequences. | ||
| num_draft_tokens_to_propose: Number of speculative tokens to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shorter name? num_spec_tokens?
| # Determine expected sequence lengths in the Eagle model: | ||
| # - For prefill sequences, lengths remain unchanged. | ||
| # - For decoding sequences, lengths match the number of | ||
| # accepted tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is expected sequence length? A bit more context?
| Tokens: [T11, T12, T13, T14, T21, T22, T23, T31, T32, T33] | ||
| Positions: [0, 1, 2, 3, 9, 10, 11, 44, 45, 46] | ||
| Hidden States: [H11, H12, H13, H14, H21, H22, H23, H31, H32, H33] | ||
| Sampled Tokens: [[T15], [], [T32]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't get this. Where is T33 from?
Do you mean this?
Input Tokens: [[T11, T12, T13, T14], [T21, T22, T23], [T31, T32 (draft), T33 (draft)]]
Sampled Tokens: [[T15], [], [T32' (recovered token)]]
| target_model_positions: Tensor, target_model_hidden_states: Tensor, | ||
| target_model_seq_lens: list[int], | ||
| sampled_token_ids: list[list[int]], | ||
| next_prompt_token_ids: list[list[int]], is_prefill: list[bool], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does is_prefill mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you know, there's no "prefill" in V1.
|
hwo to load model by tp for eagle |
No description provided.