-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables #25274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new environment variable VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH to allow users to configure the max_num_splits for FlashAttention with CUDA graphs. The changes correctly add the environment variable definition, parsing logic, and integrate it into the attention backend. However, there is a logical flaw in how the environment variable is defined and consumed. The current implementation in vllm/envs.py causes the check in vllm/v1/attention/backends/mla/flashattn_mla.py to always be true, leading to dead code. My review provides suggestions to align the implementation with the existing pattern for optional integer environment variables in the codebase, which will fix the logical issue and improve code clarity and consistency.
| @@ -118,6 +118,7 @@ | |||
| VLLM_SERVER_DEV_MODE: bool = False | |||
| VLLM_V1_OUTPUT_PROC_CHUNK_SIZE: int = 128 | |||
| VLLM_MLA_DISABLE: bool = False | |||
| VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: int = 16 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other optional integer environment variables like VLLM_FLASH_ATTN_VERSION, it's better to define this as Optional[int] and handle the default value in the consumer module (flashattn_mla.py). This makes the intent clearer that the variable is optional and has a fallback. This change is related to another suggested change for the lambda function of this environment variable.
| VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: int = 16 | |
| VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH: Optional[int] = None |
vllm/envs.py
Outdated
| "VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH": | ||
| lambda: int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH", "16")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To allow checking if the environment variable was explicitly set, this lambda should return None when the variable is not present. The current implementation always returns an integer, which causes a logical flaw in flashattn_mla.py. Using maybe_convert_int without a default for os.getenv is the standard pattern in this file for optional integer variables like VLLM_FLASH_ATTN_VERSION.
| "VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH": | |
| lambda: int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH", "16")), | |
| "VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH": | |
| lambda: maybe_convert_int(os.getenv("VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH")), |
| if envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH is not None: | ||
| logger.info_once("Getting flash attention max num splits for " | ||
| "cuda graph from environment variable, value=%s", | ||
| envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH) | ||
| self.max_num_splits = envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH | ||
| else: | ||
| self.max_num_splits = _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a logical flaw here. The if condition envs.VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH is not None will always evaluate to True because of how the environment variable is defined in vllm/envs.py. The lambda function for it always returns an integer (defaulting to 16 if not set), never None. This makes the else block unreachable (dead code).
To fix this, you should modify vllm/envs.py to follow the pattern of other optional integer environment variables. Specifically:
- Change the type hint to
Optional[int] = None. - Change the lambda to use
maybe_convert_int(os.getenv(...))without a default, so it returnsNoneif the variable is not set.
With those changes in vllm/envs.py, this block of code will work as intended. I've added separate comments in vllm/envs.py with the specific suggestions.
|
Thanks for the contribution! Could you also update the non-MLA flash attention backend to use this env var? Regarding gemini's comments, I think you can get rid of |
thanks Matt, updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the contribution!
… users have control over cuda graph max_num_splits in cli level Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
…LA flash attention Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
2755ba8 to
5257c7b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
5257c7b to
7a418c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the api process count and rank related to this PR? seems like a bad merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh seems like this merged pr https://github.com/vllm-project/vllm/pull/23717/files is included in mine somehow. Let me try to fix it.
551ed9e to
7a4b528
Compare
Signed-off-by: qqma <qqma@amazon.com>
Signed-off-by: qqma <qqma@amazon.com>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: charlifu <charlifu@amd.com>
…v variables (#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: gaojc <1055866782@qq.com>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…v variables (vllm-project#25274) Signed-off-by: qqma <qqma@amazon.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: qqma <qqma@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Add
VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPHin env variables so users have control over cuda graph max_num_splits in cli level.When applying #23958, realized the _DEFAULT_MAX_NUM_SPLITS_FOR_CUDA_GRAPH value is copied from Flash_attn (code ref) and this mentioned to be tuned if needed. Thinking we should surface this to front end.
Test Plan
Tested based off docker image
vllm/vllm-openai:v0.10.2with this prTest Result
Quality check
Command:
Flash Attention Quality Check on
Mixtral-8x7B