-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[WIP][V1][Spec Decode] EAGLE tree-attention #17560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
| PADDING_SLOT_ID = -1 | ||
|
|
||
|
|
||
| class TreeArray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding in real time is going to be very costly. Dynamic tree in actual production could be less effective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! That's why I pre-allocate the "max_nodes" in advance. The difference from chain drafting is the size would be larger and number of tokens passed to forward pass is larger. The benefit can be longer acceptance length, which could reduce forward passes in target model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that. We also need to have logits shifting logic for sampling and tree dynamic in actual use might have less efficiency in drafting. Maybe we could start with support for a static tree as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. the sampling logic should be changed also. It is not included yet. You can see a tracker of the progress in: https://docs.google.com/document/d/1mMoSicPPMMzaE_T5Zk2SnTderw1OXRUs2T16JxfVGCQ/edit?usp=sharing
And IMO, from static tree to dynamic tree, it won't introduce much difference (select all/top-k to expand & rerank logic). The major differene are from the tree structure comparing with the chain draft. But I'm open to community's opinions on which should we target on first.
| with set_forward_context(attn_metadata, | ||
| self.vllm_config, | ||
| num_tokens=input_batch_size): | ||
| last_hidden_states, output_hidden_states = self.model( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious in ROPE kernel do we already take into consideration that positions can be customized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up! IIUC, in the rotary_embedding.py, we could pass an offsets to the forward function.
We have to custom the logic since different path have been mixed together and I would categorize it in "Attention metadata & attention mask" in the tracker. For now, it is only a place-holder.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
| with set_forward_context(tree_per_layer_attn_metadata, | ||
| self.vllm_config, | ||
| num_tokens=input_batch_size): | ||
| last_hidden_states, output_hidden_states = self.model( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Close because tree attention was supported in #20401 |

As mentioned in #15901, currently we only support top-1 selection from the candidates from the EAGLE model (we call it chain-draft), and in EAGLE and EAGLE-2, both are claim select top-k tokens from each forward pass can benefit the acceptance rate, so we want to support it (we call it tree-draft).
As this would be a big change, I would like to work on a WIP PR and would be appreciated to receive any comments/suggestions/discussion during implementation.
Design Doc: https://docs.google.com/document/d/1mMoSicPPMMzaE_T5Zk2SnTderw1OXRUs2T16JxfVGCQ/edit?usp=sharing
cc: @LiuXiaoxuanPKU @WoosukKwon