-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell #23596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
281a96a to
bace05e
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
bca2b10 to
f334373
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
f334373 to
ad6157b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
7d5e472 to
d957ce9
Compare
|
Performance benchmark on w/ eagle3 |
| @@ -204,6 +204,11 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): | |||
| nn.Module.__init__(self) | |||
| self.config = vllm_config. \ | |||
| speculative_config.draft_model_config.hf_config | |||
| # Ensure draft_vocab_size is set | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check if this logic is also present in llama_eagle.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add same logic to llama_eagle.py
vllm/v1/spec_decode/eagle.py
Outdated
| @@ -888,6 +880,30 @@ def validate_same_kv_cache_group(self, | |||
| ]) | |||
| ) == 1, "All eagle layers should belong to the same kv cache group" | |||
|
|
|||
| def _get_attention_metadata_builder(self, ubatch_id): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this logic needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some models like gpt-oss has multiple attention backends ( sliding window attn and full attn), we want to make sure the draft_model is using the correct attention_metadata builder. Otherwise there will be accuracy issue
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
d061669 to
74bd334
Compare
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
vllm/v1/spec_decode/eagle.py
Outdated
| attn_metadata_builder = \ | ||
| self.runner.attn_groups[0][0].metadata_builders[ubatch_id] | ||
| attn_metadata = attn_metadata_builder.build_for_drafting( | ||
| builder = self._get_attention_metadata_builder(ubatch_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result of this function can be cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, cached attention_metadata_builders because different ubatch_id may have diff builder
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
this PR is split into 2 PR:
|
Purpose
Support eagle3 for gpt-oss on Blackwell flashinfer trtllm-gen attention.
This PR uses some content in #25196, hold off till #25196 merged
Test Plan
lm_eval
Test Result
W/ eagle3
w/o eagle3
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.