[Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell #23596

jiahanc · 2025-08-25T23:03:12Z

Purpose

Support eagle3 for gpt-oss on Blackwell flashinfer trtllm-gen attention.
This PR uses some content in #25196, hold off till #25196 merged

Test Plan

lm_eval

lm_eval --model local-completions --tasks gsm8k --model_args model=openai/gpt-oss-120b,base_url=http://0.0.0.0:30000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.8

Test Result

W/ eagle3

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8286|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.6212|±  |0.0149|

w/o eagle3

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8277|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.6155|±  |0.0150|

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

mergify · 2025-08-26T02:44:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-08-28T21:02:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-10T04:35:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-16T13:06:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jiahanc · 2025-09-18T22:11:55Z

Performance benchmark on vllm bench serve with spec_bench dataset concurrency 1 and 100 requests. Seeing 1.5x speed up.
w/o eagle3

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             1         
Benchmark duration (s):                  415.52    
Total input tokens:                      34733     
Total generated tokens:                  102400    
Request throughput (req/s):              0.24      
Output token throughput (tok/s):         246.44    
Peak output token throughput (tok/s):    251.00    
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          330.03    
---------------Time to First Token----------------
Mean TTFT (ms):                          23.89     
Median TTFT (ms):                        19.70     
P99 TTFT (ms):                           47.92     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.04      
Median TPOT (ms):                        4.03      
P99 TPOT (ms):                           4.23      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.04      
Median ITL (ms):                         4.04      
P99 ITL (ms):                            4.40      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4155.07   
Median E2EL (ms):                        4141.16   
P99 E2EL (ms):                           4348.93   
==================================================

w/ eagle3

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             1         
Benchmark duration (s):                  268.29    
Total input tokens:                      34733     
Total generated tokens:                  102400    
Request throughput (req/s):              0.37      
Output token throughput (tok/s):         381.67    
Total Token throughput (tok/s):          511.13    
---------------Time to First Token----------------
Mean TTFT (ms):                          28.04     
Median TTFT (ms):                        21.66     
P99 TTFT (ms):                           52.68     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.60      
Median TPOT (ms):                        2.59      
P99 TPOT (ms):                           3.34      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.35      
Median ITL (ms):                         6.35      
P99 ITL (ms):                            6.64      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2682.78   
Median E2EL (ms):                        2698.46   
P99 E2EL (ms):                           3439.56   
==================================================

benchislett · 2025-09-18T22:24:49Z

vllm/model_executor/models/llama_eagle3.py

@@ -204,6 +204,11 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
        nn.Module.__init__(self)
        self.config = vllm_config. \
            speculative_config.draft_model_config.hf_config
+        # Ensure draft_vocab_size is set


Could you check if this logic is also present in llama_eagle.py?

Add same logic to llama_eagle.py

benchislett · 2025-09-18T22:29:52Z

vllm/v1/spec_decode/eagle.py

@@ -888,6 +880,30 @@ def validate_same_kv_cache_group(self,
            ])
        ) == 1, "All eagle layers should belong to the same kv cache group"

+    def _get_attention_metadata_builder(self, ubatch_id):


Why is this logic needed?

some models like gpt-oss has multiple attention backends ( sliding window attn and full attn), we want to make sure the draft_model is using the correct attention_metadata builder. Otherwise there will be accuracy issue

mergify · 2025-09-19T17:11:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

heheda12345 · 2025-09-19T17:19:57Z

vllm/v1/spec_decode/eagle.py

-        attn_metadata_builder = \
-            self.runner.attn_groups[0][0].metadata_builders[ubatch_id]
-        attn_metadata = attn_metadata_builder.build_for_drafting(
+        builder = self._get_attention_metadata_builder(ubatch_id)


I think the result of this function can be cached.

Done, cached attention_metadata_builders because different ubatch_id may have diff builder

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

mergify · 2025-09-22T17:39:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jiahanc · 2025-09-26T01:29:00Z

this PR is split into 2 PR:

Main change to [Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue #25406
left change are part of [Spec Decode] Enable FlashInfer Spec Decoding #25196
close this one

mergify bot added llama Related to Llama models new-model Requests to new models qwen Related to Qwen models gpt-oss Related to GPT-OSS models speculative-decoding v1 labels Aug 25, 2025

jiahanc changed the title ~~[Feat][Spec Dec] Support gpt-oss eagle3 on blackwell~~ [Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell Aug 25, 2025

mergify bot added the needs-rebase label Aug 26, 2025

jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch from 281a96a to bace05e Compare August 26, 2025 18:24

mergify bot removed the needs-rebase label Aug 26, 2025

mergify bot added the needs-rebase label Aug 28, 2025

jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch from bca2b10 to f334373 Compare September 3, 2025 22:22

mergify bot removed the needs-rebase label Sep 3, 2025

mergify bot added the needs-rebase label Sep 10, 2025

yeqcharlotte added this to gpt-oss Issues & Enhancements Sep 14, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Sep 14, 2025

jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch from f334373 to ad6157b Compare September 15, 2025 22:06

mergify bot added performance Performance-related issues and removed needs-rebase labels Sep 15, 2025

mergify bot added needs-rebase and removed needs-rebase labels Sep 16, 2025

jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch 2 times, most recently from 7d5e472 to d957ce9 Compare September 18, 2025 21:49

benchislett reviewed Sep 18, 2025

View reviewed changes

jiahanc requested review from DarkLight1337, NickLucche, aarnphm and chaunceyjiang as code owners September 19, 2025 17:10

mergify bot added ci/build frontend rocm Related to AMD ROCm labels Sep 19, 2025

mergify bot added the needs-rebase label Sep 19, 2025

jiahanc added 8 commits September 19, 2025 10:13

support eagle3

451338e

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix conflict

60edfb6

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix attention metadata builder

c25396f

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

use decode attention for spec dec

387342b

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

format

8b42ea8

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix format

6b6b4df

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

add vacab logic to llama_egale

2821779

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix pre-commit

74bd334

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch from d061669 to 74bd334 Compare September 19, 2025 17:14

mergify bot removed the needs-rebase label Sep 19, 2025

fix conflict

5e6992d

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

heheda12345 reviewed Sep 19, 2025

View reviewed changes

jiahanc added 3 commits September 19, 2025 10:58

cache attention metadata builders

619f6a3

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix pre-commit

39a2027

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix format

ba8dccd

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

jiahanc mentioned this pull request Sep 22, 2025

[Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue #25406

Merged

5 tasks

mergify bot added the needs-rebase label Sep 22, 2025

jiahanc closed this Sep 26, 2025

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Sep 26, 2025

Uh oh!

[Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell #23596

[Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell #23596

Uh oh!

Conversation

jiahanc commented Aug 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

mergify bot commented Aug 26, 2025

Uh oh!

mergify bot commented Aug 28, 2025

Uh oh!

mergify bot commented Sep 10, 2025

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

jiahanc commented Sep 18, 2025

Uh oh!

benchislett Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jiahanc Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jiahanc Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

heheda12345 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

jiahanc Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

jiahanc commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiahanc commented Aug 25, 2025 •

edited by github-actions bot

Loading

jiahanc Sep 18, 2025 •

edited

Loading