-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP #27762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP #27762
Conversation
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug where the Flashinfer autotuner fails during engine startup under specific conditions: when using data or tensor parallelism with a flashinfer mxfp4 backend in eager mode. The fix correctly identifies this problematic configuration by checking for tensor or data parallelism, the usage of any flashinfer mxfp4 backend, and whether the execution mode is eager. The changes are well-implemented, using clearly named boolean variables to improve readability of the condition. The fix appears correct and complete based on the problem description and test results.
|
@nvpohanh I verified that with the PR, flashinfer autotuning happens. |
|
@elvischenv has also verified this. Thanks for the quick fix! |
|
@mgoin It would be great if we can include this in v0.11.1 to avoid unexpected performance regression. Thanks! |
Purpose
Running the Flashinfer autotuner when,
Causes the engine startup to fail with,
This error was initially thought to be related to DP + certain choices for the MXFP4 backend. This PR updates the skip condition.
Test Plan and Test Result
B200 DP
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010 --enforce-eagerVLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010 --enforce-eagerVLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010 --enforce-eagerB200 TP
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 1 --tensor-parallel-size 2 --no-enable-prefix-caching --port 9010 --enforce-eagerVLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 1 --tensor-parallel-size 2 --no-enable-prefix-caching --port 9010 --enforce-eagerVLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 1 --tensor-parallel-size 2 --no-enable-prefix-caching --port 9010 --enforce-eagerH100 DP
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010 --enforce-eagerH100 TP
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 1 --tensor-parallel-size 2 --no-enable-prefix-caching --port 9010 --enforce-eager