Skip to content

Conversation

@jiahanc
Copy link
Contributor

@jiahanc jiahanc commented Aug 25, 2025

Purpose

Support eagle3 for gpt-oss on Blackwell flashinfer trtllm-gen attention.
This PR uses some content in #25196, hold off till #25196 merged

Test Plan

lm_eval

lm_eval --model local-completions --tasks gsm8k --model_args model=openai/gpt-oss-120b,base_url=http://0.0.0.0:30000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.8

Test Result

W/ eagle3

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8286|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.6212|±  |0.0149|

w/o eagle3

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8277|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.6155|±  |0.0150|

(Optional) Documentation Update


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify bot added llama Related to Llama models new-model Requests to new models qwen Related to Qwen models gpt-oss Related to GPT-OSS models speculative-decoding v1 labels Aug 25, 2025
@jiahanc jiahanc changed the title [Feat][Spec Dec] Support gpt-oss eagle3 on blackwell [Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell Aug 25, 2025
@mergify
Copy link

mergify bot commented Aug 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 26, 2025
@jiahanc jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch from 281a96a to bace05e Compare August 26, 2025 18:24
@mergify mergify bot removed the needs-rebase label Aug 26, 2025
@mergify
Copy link

mergify bot commented Aug 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 28, 2025
@jiahanc jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch from bca2b10 to f334373 Compare September 3, 2025 22:22
@mergify mergify bot removed the needs-rebase label Sep 3, 2025
@mergify
Copy link

mergify bot commented Sep 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Sep 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@jiahanc jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch 2 times, most recently from 7d5e472 to d957ce9 Compare September 18, 2025 21:49
@jiahanc
Copy link
Contributor Author

jiahanc commented Sep 18, 2025

Performance benchmark on vllm bench serve with spec_bench dataset concurrency 1 and 100 requests. Seeing 1.5x speed up.
w/o eagle3

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             1         
Benchmark duration (s):                  415.52    
Total input tokens:                      34733     
Total generated tokens:                  102400    
Request throughput (req/s):              0.24      
Output token throughput (tok/s):         246.44    
Peak output token throughput (tok/s):    251.00    
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          330.03    
---------------Time to First Token----------------
Mean TTFT (ms):                          23.89     
Median TTFT (ms):                        19.70     
P99 TTFT (ms):                           47.92     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.04      
Median TPOT (ms):                        4.03      
P99 TPOT (ms):                           4.23      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.04      
Median ITL (ms):                         4.04      
P99 ITL (ms):                            4.40      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4155.07   
Median E2EL (ms):                        4141.16   
P99 E2EL (ms):                           4348.93   
==================================================

w/ eagle3

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             1         
Benchmark duration (s):                  268.29    
Total input tokens:                      34733     
Total generated tokens:                  102400    
Request throughput (req/s):              0.37      
Output token throughput (tok/s):         381.67    
Total Token throughput (tok/s):          511.13    
---------------Time to First Token----------------
Mean TTFT (ms):                          28.04     
Median TTFT (ms):                        21.66     
P99 TTFT (ms):                           52.68     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.60      
Median TPOT (ms):                        2.59      
P99 TPOT (ms):                           3.34      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.35      
Median ITL (ms):                         6.35      
P99 ITL (ms):                            6.64      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2682.78   
Median E2EL (ms):                        2698.46   
P99 E2EL (ms):                           3439.56   
==================================================

@@ -204,6 +204,11 @@ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
nn.Module.__init__(self)
self.config = vllm_config. \
speculative_config.draft_model_config.hf_config
# Ensure draft_vocab_size is set
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if this logic is also present in llama_eagle.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add same logic to llama_eagle.py

@@ -888,6 +880,30 @@ def validate_same_kv_cache_group(self,
])
) == 1, "All eagle layers should belong to the same kv cache group"

def _get_attention_metadata_builder(self, ubatch_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this logic needed?

Copy link
Contributor Author

@jiahanc jiahanc Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some models like gpt-oss has multiple attention backends ( sliding window attn and full attn), we want to make sure the draft_model is using the correct attention_metadata builder. Otherwise there will be accuracy issue

@mergify
Copy link

mergify bot commented Sep 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 19, 2025
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
@jiahanc jiahanc force-pushed the jiahanc/gpt-oss-eagle3 branch from d061669 to 74bd334 Compare September 19, 2025 17:14
@mergify mergify bot removed the needs-rebase label Sep 19, 2025
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
attn_metadata_builder = \
self.runner.attn_groups[0][0].metadata_builders[ubatch_id]
attn_metadata = attn_metadata_builder.build_for_drafting(
builder = self._get_attention_metadata_builder(ubatch_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the result of this function can be cached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, cached attention_metadata_builders because different ubatch_id may have diff builder

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Sep 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 22, 2025
@jiahanc
Copy link
Contributor Author

jiahanc commented Sep 26, 2025

this PR is split into 2 PR:

  1. Main change to [Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue #25406
  2. left change are part of [Spec Decode] Enable FlashInfer Spec Decoding #25196
    close this one

@jiahanc jiahanc closed this Sep 26, 2025
@github-project-automation github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build frontend gpt-oss Related to GPT-OSS models llama Related to Llama models needs-rebase new-model Requests to new models performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants