[Perf] Enable full CUDA graphs for spec decoding with FlashInfer #26937

benchislett · 2025-10-15T19:12:04Z

Purpose

TRTLLM-gen kernels support full cuda graphs, but are only used with FlashInfer on Blackwell under certain conditions.
It might not be safe to change FlashInfer's cudagraph_support to UNIFORM_BATCH always, but we can still set it when we know TRTLLM-gen backend will be used.

Also update the docs to reflect the FlashInfer and FlashInferMLA cuda graph compatibility

FIX #26856

Test Plan

Ran Llama 3.1 8B-Instruct with EAGLE3 and confirmed that lm_eval-gsm8k is unchanged compared to main, and when TRTLLM attention is force disabled. Confirmed via torch profile that full graphs are now issued for verification when TRTLLM attention is enabled

Test Result

TRTLLM on:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7726|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

TRTLLM off:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7726|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

Benchmarks

MT-Bench at concurrency 1 sees a minimal speedup (~2%)

vllm serve meta-llama/Llama-3.1-8B-Instruct --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}' &

vllm bench serve --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --max-concurrency 1 --model meta-llama/Llama-3.1-8B-Instruct --base-url http://0.0.0.0:8049

Before:

============ Serving Benchmark Result ============
Successful requests:                     80        
Maximum request concurrency:             1         
Benchmark duration (s):                  42.58     
Total input tokens:                      8133      
Total generated tokens:                  16955     
Request throughput (req/s):              1.88      
Output token throughput (tok/s):         398.23    
Peak output token throughput (tok/s):    186.00    
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          589.25    
---------------Time to First Token----------------
Mean TTFT (ms):                          12.11     
Median TTFT (ms):                        11.88     
P99 TTFT (ms):                           14.29     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.45      
Median TPOT (ms):                        2.45      
P99 TPOT (ms):                           3.33      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.36      
Median ITL (ms):                         5.36      
P99 ITL (ms):                            5.61      
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     80        
Maximum request concurrency:             1         
Benchmark duration (s):                  41.73     
Total input tokens:                      8133      
Total generated tokens:                  16795     
Request throughput (req/s):              1.92      
Output token throughput (tok/s):         402.47    
Peak output token throughput (tok/s):    190.00    
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          597.37    
---------------Time to First Token----------------
Mean TTFT (ms):                          11.86     
Median TTFT (ms):                        11.75     
P99 TTFT (ms):                           14.93     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.43      
Median TPOT (ms):                        2.37      
P99 TPOT (ms):                           3.39      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.26      
Median ITL (ms):                         5.25      
P99 ITL (ms):                            5.48      
==================================================

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify · 2025-10-15T19:12:45Z

Documentation preview: https://vllm--26937.org.readthedocs.build/en/26937/

gemini-code-assist

Code Review

This pull request enables full CUDA graphs for speculative decoding with FlashInfer when TRT-LLM attention kernels are available, which is a valuable performance enhancement. The implementation correctly updates the cudagraph_support attribute in FlashInferMetadataBuilder at runtime based on whether TRT-LLM attention can be used. The change from a class variable to an instance variable for cudagraph_support is appropriate for this dynamic behavior. The documentation has also been updated to reflect these changes. The logic appears sound and the provided test results indicate that correctness is maintained while enabling this optimization.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/attention/backends/flashinfer.py

vadiklyutiy · 2025-10-15T19:26:08Z

Regarding performance improvement.
I did try on Qwen3-next with 2 prediction tokens.
With batch=1 it improves from 92 toks/s -> 222 toks/s

mgoin · 2025-10-15T21:35:24Z

cc @LucasWilkinson @ProExpertProg regarding updating AttentionCGSupport dynamically

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

LucasWilkinson · 2025-10-16T19:45:27Z

cc @LucasWilkinson @ProExpertProg regarding updating AttentionCGSupport dynamically

Dynamically updating it should be fine since we only call it here on instances here

vllm/vllm/v1/worker/gpu_model_runner.py

Line 3968 in a5464dc

if builder.cudagraph_support.value < min_cg_support.value:

. But if we are going to dynamically update it I think we should make it an instance property instead of a class variable just to avoid confusion and future bugs.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…ables Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

LucasWilkinson

LGTM; there are some nits that should be addressed (specifically for the CPU backend I think we should still keep the reorder_batch_threshold = 1)

it is a bit harder to see where cudagraph_support is set now :/ I guess the alternative would be use a function; i.e. add a get_cudagraph_support() function in the base class (I think the current implementation is better but im also flip-flopping haha)

LucasWilkinson · 2025-10-22T19:20:44Z

vllm/v1/attention/backends/cpu_attn.py


 class TorchSDPAMetadataBuilderV1(AttentionMetadataBuilder[TorchSDPAMetadata]):
-    reorder_batch_threshold: int = 1
+    reorder_batch_threshold: int


nit: I still think this needs to be set?

It is set in the constructor: _init_reorder_batch_threshold(1, False)

The type annotation is left to indicate that it will never be "None" on this class and its subclasses. This is a common pattern in the changes in this PR

LucasWilkinson · 2025-10-22T19:23:34Z

vllm/v1/attention/backends/xformers.py

    AttentionMetadataBuilder[XFormersAttentionMetadata]
 ):
-    reorder_batch_threshold: int = 1
+    reorder_batch_threshold: int


nit: is this still needed?

yes, as type annotation, see previous comment

LucasWilkinson · 2025-10-22T19:23:49Z

vllm/v1/attention/backends/mla/indexer.py

-    )
-
-    reorder_batch_threshold: int = 1
+    reorder_batch_threshold: int


nit: is this still needed?

yes, as type annotation, see previous comment

use full cuda graphs for spec when FlashInfer allows

56158cb

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested a review from mgoin as a code owner October 15, 2025 19:12

mergify bot added documentation Improvements or additions to documentation v1 labels Oct 15, 2025

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

Merge branch 'main' into flashinfer-spec-fullgraph

983d50d

chatgpt-codex-connector bot reviewed Oct 15, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

ggg-s mentioned this pull request Oct 16, 2025

feat: spec decode with draft models #24322

Open

force flashinfer to use trtllm attention with spec when available

ba172c8

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

vadiklyutiy mentioned this pull request Oct 20, 2025

[Tracking Issue]: Qwen3-next performance optimisations #27225

Open

8 tasks

benchislett added 2 commits October 21, 2025 17:22

Merge branch 'main' into flashinfer-spec-fullgraph

3ced8b1

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

make reorder_batch_threshold and cudagraph_support into instance vari…

350b888

…ables Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from LucasWilkinson, gshtras, pavanimajety and tdoublep as code owners October 21, 2025 18:12

mergify bot added the rocm Related to AMD ROCm label Oct 21, 2025

Merge branch 'main' into flashinfer-spec-fullgraph

41a0624

LucasWilkinson reviewed Oct 22, 2025

View reviewed changes

Uh oh!

Uh oh!

[Perf] Enable full CUDA graphs for spec decoding with FlashInfer #26937

Are you sure you want to change the base?

[Perf] Enable full CUDA graphs for spec decoding with FlashInfer #26937

Uh oh!

Conversation

benchislett commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

TRTLLM on:

TRTLLM off:

Benchmarks

Before:

After:

Uh oh!

mergify bot commented Oct 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

vadiklyutiy commented Oct 15, 2025

Uh oh!

mgoin commented Oct 15, 2025

Uh oh!

LucasWilkinson commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benchislett commented Oct 15, 2025 •

edited by github-actions bot

Loading

LucasWilkinson commented Oct 16, 2025 •

edited

Loading