-
-
Couldn't load subscription status.
- Fork 10.8k
[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE #25444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE #25444
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request changes the default CUDA graph mode for the v1 engine from PIECEWISE to FULL_AND_PIECEWISE when using piecewise compilation. This is a performance optimization, as FULL_AND_PIECEWISE can leverage full CUDA graphs for decode steps, which is often more efficient. The change is implemented by updating the default-setting logic in VllmConfig.__post_init__. Correspondingly, docstrings in CompilationConfig have been updated to reflect that FULL_AND_PIECEWISE is now the default mode. The changes are logical, self-contained, and appear to be correct. I have not identified any critical or high-severity issues.
|
Please wait on this, the time to capture the |
|
I was just reading up on this and tried out |
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM thanks! left one nit
Signed-off-by: mgoin <mgoin64@gmail.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>
…ISE (#25444) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: gaojc <1055866782@qq.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com>
…ISE (vllm-project#25444) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
This PR proposes to enable using full cudagraphs by default in vLLM V1. Support for cudagraphs beyond piecewise only was added over the past months (notably #20059) and while the startup time increase is measurable, we believe the performance gain for full cudagraphs is worth it, especially for low latency with small models or MoEs.
For instance, running Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 on 1xH100 with vLLM for 10 requests of 1024 in and 128 out results in the following throughputs:
vllm serve(default): 4.07 req/svllm serve --async-scheduling: 4.29 req/svllm serve -O.cudagraph_mode=FULL: 5.83 req/svllm serve --async-scheduling -O.cudagraph_mode=FULL: 6.03 req/svllm serve -O.cudagraph_mode=FULL_AND_PIECEWISE: 5.98 req/svllm serve --async-scheduling -O.cudagraph_mode=FULL_AND_PIECEWISE: 6.20 req/sBenchmark command:
vllm bench serve --model Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --port 8000 --num-prompts 10====
Startup time impact:
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.