[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables #25274

Daisy-Ma-coder · 2025-09-19T17:46:10Z

Purpose

Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables so users have control over cuda graph max_num_splits in cli level.

When applying #23958, realized the _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH value is copied from Flash_attn (code ref) and this mentioned to be tuned if needed. Thinking we should surface this to front end.

Test Plan

Tested based off docker image vllm/vllm-openai:v0.10.2 with this pr

VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH=64 VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA vllm serve deepseek-ai/DeepSeek-V3 \
    --port 3000 \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --max-num-seqs 8 \
    --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'

Test Result

max_num_splits = default 16

============ Serving Benchmark Result ============
Successful requests:                     20        
Benchmark duration (s):                  6.54      
Total input tokens:                      634860    
Total generated tokens:                  938       
Request throughput (req/s):              3.06      
Output token throughput (tok/s):         143.37    
Total Token throughput (tok/s):          97181.20  
---------------Time to First Token----------------
Mean TTFT (ms):                          1734.83   
Median TTFT (ms):                        1381.53   
P99 TTFT (ms):                           4708.08   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.00     
Median TPOT (ms):                        34.75     
P99 TPOT (ms):                           136.90    
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.47     
Median ITL (ms):                         24.77     
P99 ITL (ms):                            138.39    
==================================================

max_num_splits = 32


============ Serving Benchmark Result ============
Successful requests:                     20        
Benchmark duration (s):                  6.09      
Total input tokens:                      634860    
Total generated tokens:                  1124      
Request throughput (req/s):              3.28      
Output token throughput (tok/s):         184.58    
Total Token throughput (tok/s):          104439.30 
---------------Time to First Token----------------
Mean TTFT (ms):                          1791.15   
Median TTFT (ms):                        982.61    
P99 TTFT (ms):                           4108.43   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.94     
Median TPOT (ms):                        32.48     
P99 TPOT (ms):                           156.46    
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.34     
Median ITL (ms):                         24.66     
P99 ITL (ms):                            146.61    
==================================================

max_num_splits = 64

============ Serving Benchmark Result ============
Successful requests:                     20        
Benchmark duration (s):                  7.37      
Total input tokens:                      634860    
Total generated tokens:                  1133      
Request throughput (req/s):              2.71      
Output token throughput (tok/s):         153.73    
Total Token throughput (tok/s):          86293.77  
---------------Time to First Token----------------
Mean TTFT (ms):                          2379.76   
Median TTFT (ms):                        1592.55   
P99 TTFT (ms):                           5235.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.26     
Median TPOT (ms):                        30.98     
P99 TPOT (ms):                           178.50    
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.38     
Median ITL (ms):                         24.71     
P99 ITL (ms):                            169.26    
==================================================

Quality check

pip install lm_eval  # inside docker container

Command:

VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH = 64 VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA lm_eval \
  --model vllm \
  --model_args '{
    "pretrained": "deepseek-ai/DeepSeek-V2-Lite-Chat",
    "tensor_parallel_size": 8,
    "dtype": "auto",
    "gpu_memory_utilization": 0.9,
    "trust_remote_code": true,
    "max_model_len": 16384,
    "compilation_config": {
      "cudagraph_mode": "FULL_DECODE_ONLY"
    }
  }' \
  --task gsm8k \
  --num_fewshot 5 \
  --batch_size auto

vllm ({'pretrained': 'deepseek-ai/DeepSeek-V2-Lite-Chat', 'tensor_parallel_size': 8, 'dtype': 'auto', 'gpu_memory_utilization': 0.9, 'trust_remote_code': True, 'max_model_len': 16384, 'compilation_config': {'cudagraph_mode': 'FULL_DECODE_ONLY'}}), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6649|±  |0.0130|
|     |       |strict-match    |     5|exact_match|↑  |0.6535|±  |0.0131|

Flash Attention Quality Check on Mixtral-8x7B

lm_eval \
  --model vllm \
  --model_args '{
    "pretrained": "RedHatAI/Mixtral-8x7B-Instruct-v0.1",
    "tensor_parallel_size": 8,
    "dtype": "auto",
    "gpu_memory_utilization": 0.9,
    "trust_remote_code": true,
    "max_model_len": 16384,
    "compilation_config": {
      "cudagraph_mode": "FULL"
    }
  }' \
  --task gsm8k \
  --num_fewshot 5 \
  --batch_size auto

vllm ({'pretrained': 'RedHatAI/Mixtral-8x7B-Instruct-v0.1', 'tensor_parallel_size': 8, 'dtype': 'auto', 'gpu_memory_utilization': 0.9, 'trust_remote_code': True, 'max_model_len': 16384, 'compilation_config': {'cudagraph_mode': 'FULL'}}), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6406|±  |0.0132|
|     |       |strict-match    |     5|exact_match|↑  |0.6384|±  |0.0132|

github-actions · 2025-09-19T17:46:21Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a new environment variable VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH to allow users to configure the max_num_splits for FlashAttention with CUDA graphs. The changes correctly add the environment variable definition, parsing logic, and integrate it into the attention backend. However, there is a logical flaw in how the environment variable is defined and consumed. The current implementation in vllm/envs.py causes the check in vllm/v1/attention/backends/mla/flashattn_mla.py to always be true, leading to dead code. My review provides suggestions to align the implementation with the existing pattern for optional integer environment variables in the codebase, which will fix the logical issue and improve code clarity and consistency.

gemini-code-assist · 2025-09-19T17:48:08Z

vllm/envs.py

@@ -118,6 +118,7 @@
    VLLM_SERVER_DEV_MODE: bool = False
    VLLM_V1_OUTPUT_PROC_CHUNK_SIZE: int = 128
    VLLM_MLA_DISABLE: bool = False
+    VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: int = 16


For consistency with other optional integer environment variables like VLLM_FLASH_ATTN_VERSION, it's better to define this as Optional[int] and handle the default value in the consumer module (flashattn_mla.py). This makes the intent clearer that the variable is optional and has a fallback. This change is related to another suggested change for the lambda function of this environment variable.

Suggested change

VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: int = 16

VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: Optional[int] = None

gemini-code-assist · 2025-09-19T17:48:08Z

vllm/envs.py

+    "VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH":
+    lambda: int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH", "16")),


To allow checking if the environment variable was explicitly set, this lambda should return None when the variable is not present. The current implementation always returns an integer, which causes a logical flaw in flashattn_mla.py. Using maybe_convert_int without a default for os.getenv is the standard pattern in this file for optional integer variables like VLLM_FLASH_ATTN_VERSION.

Suggested change

"VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH":

lambda: int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH", "16")),

"VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH":

lambda: maybe_convert_int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH")),

gemini-code-assist · 2025-09-19T17:48:08Z

vllm/v1/attention/backends/mla/flashattn_mla.py

+            if envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH is not None:
+                logger.info_once("Getting flash attention max num splits for "
+                                 "cuda graph from environment variable, value=%s",
+                                 envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH)
+                self.max_num_splits = envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH
+            else:
+                self.max_num_splits = _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH


There is a logical flaw here. The if condition envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH is not None will always evaluate to True because of how the environment variable is defined in vllm/envs.py. The lambda function for it always returns an integer (defaulting to 16 if not set), never None. This makes the else block unreachable (dead code).

To fix this, you should modify vllm/envs.py to follow the pattern of other optional integer environment variables. Specifically:

Change the type hint to Optional[int] = None.

Change the lambda to use maybe_convert_int(os.getenv(...)) without a default, so it returns None if the variable is not set.

With those changes in vllm/envs.py, this block of code will work as intended. I've added separate comments in vllm/envs.py with the specific suggestions.

MatthewBonanni · 2025-09-19T19:05:56Z

Thanks for the contribution! Could you also update the non-MLA flash attention backend to use this env var? Regarding gemini's comments, I think you can get rid of _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH entirely and let the env var manage the default

Daisy-Ma-coder · 2025-09-19T21:58:48Z

Thanks for the contribution! Could you also update the non-MLA flash attention backend to use this env var? Regarding gemini's comments, I think you can get rid of _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH entirely and let the env var manage the default

thanks Matt, updated.

LucasWilkinson

LGTM! Thanks for the contribution!

… users have control over cuda graph max_num_splits in cli level Signed-off-by: qqma <qqma@amazon.com>

Signed-off-by: qqma <qqma@amazon.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: qqma <qqma@amazon.com>

Signed-off-by: qqma <qqma@amazon.com>

…LA flash attention Signed-off-by: qqma <qqma@amazon.com>

Signed-off-by: qqma <qqma@amazon.com>

mergify · 2025-09-20T20:59:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Daisy-Ma-coder.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

simon-mo

Is the api process count and rank related to this PR? seems like a bad merge

simon-mo · 2025-09-20T23:41:59Z

benchmarks/kernels/benchmark_w8a8_block_fp8.py

is this still needed?

simon-mo · 2025-09-20T23:42:04Z

examples/others/tensorize_vllm_model.py

is this still needed?

oh seems like this merged pr https://github.com/vllm-project/vllm/pull/23717/files is included in mine somehow. Let me try to fix it.

…t#23717)" This reverts commit 6e64b12. Signed-off-by: qqma <qqma@amazon.com>

Signed-off-by: qqma <qqma@amazon.com>

…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: charlifu <charlifu@amd.com>

…v variables (#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: gaojc <1055866782@qq.com>

…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Daisy-Ma-coder requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 19, 2025 17:46

mergify bot added the v1 label Sep 19, 2025

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

LucasWilkinson approved these changes Sep 20, 2025

View reviewed changes

qqma and others added 8 commits September 20, 2025 13:58

add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables so…

08fbd5d

… users have control over cuda graph max_num_splits in cli level Signed-off-by: qqma <qqma@amazon.com>

update tests

afb62f4

Signed-off-by: qqma <qqma@amazon.com>

resolve pre-commit test failure due to E501: line too long

bca6d5d

Signed-off-by: qqma <qqma@amazon.com>

[Frontend] Pass API server count to each process (vllm-project#23717)

6e64b12

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: qqma <qqma@amazon.com>

resolve pre-commit test failure due to E501: line too long

982937a

Signed-off-by: qqma <qqma@amazon.com>

get rid of _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH and extend to non-M…

9c6c81d

…LA flash attention Signed-off-by: qqma <qqma@amazon.com>

fix test failure on imports

92709f7

Signed-off-by: qqma <qqma@amazon.com>

update test converage to FA2 and FA3

7a418c7

Signed-off-by: qqma <qqma@amazon.com>

Daisy-Ma-coder force-pushed the main branch from 2755ba8 to 5257c7b Compare September 20, 2025 20:59

Daisy-Ma-coder requested review from ApostaC, NickLucche, bigPYJ1151, heheda12345, hmellor, jeejeelee, mgoin, sighingnow and tdoublep as code owners September 20, 2025 20:59

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Sep 20, 2025

mergify bot added needs-rebase kv-connector labels Sep 20, 2025

Daisy-Ma-coder force-pushed the main branch from 5257c7b to 7a418c7 Compare September 20, 2025 21:09

mergify bot removed tpu Related to Google TPUs needs-rebase labels Sep 20, 2025

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 20, 2025

simon-mo requested changes Sep 20, 2025

View reviewed changes

github-project-automation bot moved this from To Triage to In progress in gpt-oss Issues & Enhancements Sep 20, 2025

Revert "[Frontend] Pass API server count to each process (vllm-projec…

7a4b528

…t#23717)" This reverts commit 6e64b12. Signed-off-by: qqma <qqma@amazon.com>

Daisy-Ma-coder force-pushed the main branch from 551ed9e to 7a4b528 Compare September 20, 2025 23:55

qqma added 2 commits September 20, 2025 18:33

fix full cuda graph smoke test failure, int to str

668067b

Signed-off-by: qqma <qqma@amazon.com>

fix full cuda graph smoke test failure, int to str

9441048

Signed-off-by: qqma <qqma@amazon.com>

Daisy-Ma-coder requested a review from simon-mo September 21, 2025 03:31

simon-mo approved these changes Sep 22, 2025

View reviewed changes

github-project-automation bot moved this from In progress to Ready in gpt-oss Issues & Enhancements Sep 22, 2025

simon-mo merged commit cfbee3d into vllm-project:main Sep 22, 2025
42 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Sep 22, 2025

github-project-automation bot moved this to Done in Structured Output Sep 22, 2025

LucasWilkinson mentioned this pull request Sep 23, 2025

[Perf] Increase default max splits for FA3 full cudagraphs #25495

Merged

	VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: int = 16
	VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: Optional[int] = None

		"VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH":
		lambda: int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH", "16")),

Uh oh!

[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables #25274

[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables #25274

Conversation

Daisy-Ma-coder commented Sep 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Daisy-Ma-coder commented Sep 19, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 20, 2025

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

simon-mo Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

simon-mo Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

Daisy-Ma-coder Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Daisy-Ma-coder commented Sep 19, 2025 •

edited by github-actions bot

Loading

MatthewBonanni commented Sep 19, 2025 •

edited

Loading