Skip to content

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Sep 23, 2025

Purpose

This PR proposes to enable using full cudagraphs by default in vLLM V1. Support for cudagraphs beyond piecewise only was added over the past months (notably #20059) and while the startup time increase is measurable, we believe the performance gain for full cudagraphs is worth it, especially for low latency with small models or MoEs.

For instance, running Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 on 1xH100 with vLLM for 10 requests of 1024 in and 128 out results in the following throughputs:

  • vllm serve (default): 4.07 req/s
  • vllm serve --async-scheduling: 4.29 req/s
  • vllm serve -O.cudagraph_mode=FULL: 5.83 req/s
  • vllm serve --async-scheduling -O.cudagraph_mode=FULL: 6.03 req/s
  • vllm serve -O.cudagraph_mode=FULL_AND_PIECEWISE: 5.98 req/s
  • vllm serve --async-scheduling -O.cudagraph_mode=FULL_AND_PIECEWISE: 6.20 req/s

Benchmark command: vllm bench serve --model Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --port 8000 --num-prompts 10

====

Startup time impact:

main: python -c "from vllm import LLM;LLM('Qwen/Qwen3-0.6B')"  34.05s user 7.32s system 190% cpu 21.684 total
pr:   python -c "from vllm import LLM;LLM('Qwen/Qwen3-0.6B')"  34.99s user 9.01s system 175% cpu 25.005 total

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mgoin mgoin changed the title [Perf] Change the default CUDAGraphMode from PIECEWISE TO FULL_AND_PIECEWISE [Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE Sep 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request changes the default CUDA graph mode for the v1 engine from PIECEWISE to FULL_AND_PIECEWISE when using piecewise compilation. This is a performance optimization, as FULL_AND_PIECEWISE can leverage full CUDA graphs for decode steps, which is often more efficient. The change is implemented by updating the default-setting logic in VllmConfig.__post_init__. Correspondingly, docstrings in CompilationConfig have been updated to reflect that FULL_AND_PIECEWISE is now the default mode. The changes are logical, self-contained, and appear to be correct. I have not identified any critical or high-severity issues.

@mgoin mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Sep 23, 2025
@mgoin
Copy link
Member Author

mgoin commented Sep 23, 2025

Please wait on this, the time to capture the (decode, FULL) cudagraphs is very high. Like an order of magnitude more than the (mixed prefill-decode, PIECEWISE) or (mixed prefill-decode, FULL). We should get this down first.

@gau-nernst
Copy link
Contributor

I was just reading up on this and tried out FULL_AND_PIECEWISE today. And it's already becoming the new default. Awesome and thank you 🙏

Signed-off-by: mgoin <mgoin64@gmail.com>
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM thanks! left one nit

Signed-off-by: mgoin <mgoin64@gmail.com>
@mgoin mgoin merged commit 24fab45 into vllm-project:main Sep 23, 2025
40 checks passed
@mgoin mgoin deleted the cudagraph_mode-FULL_AND_PIECEWISE-default branch September 23, 2025 19:29
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
@russellb russellb mentioned this pull request Sep 25, 2025
5 tasks
@fhl2000 fhl2000 mentioned this pull request Sep 27, 2025
5 tasks
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…ISE (#25444)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025
…ISE (vllm-project#25444)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: gaojc <1055866782@qq.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…ISE (vllm-project#25444)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ISE (vllm-project#25444)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants