[BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP #27762

varun-sundar-rabindranath · 2025-10-29T18:04:44Z

Purpose

Running the Flashinfer autotuner when,

using data-parallel or tensor-parallel, and
using a flashinfer mxfp4 backend, and
eager-mode

Causes the engine startup to fail with,

(EngineCore_DP1 pid=3074289)   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP1 pid=3074289)     return forward_call(*args, **kwargs)
(EngineCore_DP1 pid=3074289)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=3074289)   File "/home/varun-sundar-rabindranath/code/vllm/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 1168, in forward
(EngineCore_DP1 pid=3074289)     fused_out = self._fused_experts(
(EngineCore_DP1 pid=3074289)                 ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=3074289)   File "/home/varun-sundar-rabindranath/code/vllm/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 1021, in _fused_experts
(EngineCore_DP1 pid=3074289)     self.fused_experts.apply(
(EngineCore_DP1 pid=3074289)   File "/home/varun-sundar-rabindranath/code/vllm/vllm/model_executor/layers/fused_moe/trtllm_moe.py", line 135, in apply
(EngineCore_DP1 pid=3074289)     trtllm_fp4_block_scale_routed_moe(**kwargs)
(EngineCore_DP1 pid=3074289)   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/fused_moe/core.py", line 1850, in trtllm_fp4_block_scale_routed_moe
(EngineCore_DP1 pid=3074289)     return get_trtllm_moe_sm100_module().trtllm_fp4_block_scale_moe(
(EngineCore_DP1 pid=3074289)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=3074289)   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/fused_moe/core.py", line 1348, in trtllm_fp4_block_scale_moe_op
(EngineCore_DP1 pid=3074289)     _, tactic = tuner.choose_one(
(EngineCore_DP1 pid=3074289)                 ^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=3074289)   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/autotuner.py", line 457, in choose_one
(EngineCore_DP1 pid=3074289)     profiles = self._generate_optimization_profiles(tuning_config, inputs)
(EngineCore_DP1 pid=3074289)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=3074289)   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/autotuner.py", line 643, in _generate_optimization_profiles
(EngineCore_DP1 pid=3074289)     assert len(opt_shapes) > 0, "Empty tuning buckets are not allowed"
(EngineCore_DP1 pid=3074289)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=3074289) AssertionError: Empty tuning buckets are not allowed

This error was initially thought to be related to DP + certain choices for the MXFP4 backend. This PR updates the skip condition.

Test Plan and Test Result

B200 DP

VLLM_ALL2ALL_BACKEND="deepep_high_throughput" canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010

PR Pass
main Fail

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010 --enforce-eager

PR Pass
main Pass

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010 --enforce-eager

PR Pass
main fail

VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010 --enforce-eager

PR Pass
main pass

B200 TP

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 1 --tensor-parallel-size 2 --no-enable-prefix-caching --port 9010 --enforce-eager

PR Pass
main fail

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 1 --tensor-parallel-size 2 --no-enable-prefix-caching --port 9010 --enforce-eager

PR Pass
main fail

VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 1 --tensor-parallel-size 2 --no-enable-prefix-caching --port 9010 --enforce-eager

PR Pass
main fail

H100 DP

VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010 --enforce-eager

PR pass
main pass

H100 TP

VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 canhazgpu run -g2 -- vllm serve openai/gpt-oss-20b --data-parallel-size 1 --tensor-parallel-size 2 --no-enable-prefix-caching --port 9010 --enforce-eager

PR pass
main fail

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

gemini-code-assist

Code Review

This pull request addresses a bug where the Flashinfer autotuner fails during engine startup under specific conditions: when using data or tensor parallelism with a flashinfer mxfp4 backend in eager mode. The fix correctly identifies this problematic configuration by checking for tensor or data parallelism, the usage of any flashinfer mxfp4 backend, and whether the execution mode is eager. The changes are well-implemented, using clearly named boolean variables to improve readability of the condition. The fix appears correct and complete based on the problem description and test results.

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath · 2025-10-29T18:23:17Z

@nvpohanh I verified that with the PR,

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1   vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel   --no-enable-prefix-caching  --port 9010

flashinfer autotuning happens.

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath · 2025-10-29T18:27:14Z

cc @zyongye @mgoin PTAL! Thanks.

nvpohanh · 2025-10-30T06:47:37Z

@elvischenv has also verified this. Thanks for the quick fix!

nvpohanh · 2025-10-30T09:07:49Z

@mgoin It would be great if we can include this in v0.11.1 to avoid unexpected performance regression. Thanks!

mgoin · 2025-10-30T15:23:26Z

gpt-oss test looks flaky https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/e242093d-1033-8042-a229-87b736d010dd?branch=main&period=7days

Varun Sundar Rabindranath added 2 commits October 29, 2025 10:56

relax deepep_ht skip

5e54e7e

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

better conditions

67718f1

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

mergify bot added the gpt-oss Related to GPT-OSS models label Oct 29, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Oct 29, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Oct 29, 2025

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

fix b200

c95b2d7

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

fixes

5ca0376

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath mentioned this pull request Oct 29, 2025

[Bug]: Issue with Flashinfer Autotune + DP or TP + Eager-Mode #27751

Closed

1 task

nvpohanh approved these changes Oct 30, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Oct 30, 2025

mgoin approved these changes Oct 30, 2025

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Oct 30, 2025

vllm-bot merged commit e5e076c into vllm-project:main Oct 30, 2025
46 of 48 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Oct 30, 2025

varun-sundar-rabindranath mentioned this pull request Nov 1, 2025

[BugFix][Performance] Restore flashinfer autotuning for all scenarios #27904

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP #27762

[BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP #27762

Uh oh!

varun-sundar-rabindranath commented Oct 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

varun-sundar-rabindranath commented Oct 29, 2025

Uh oh!

varun-sundar-rabindranath commented Oct 29, 2025

Uh oh!

nvpohanh commented Oct 30, 2025

Uh oh!

nvpohanh commented Oct 30, 2025

Uh oh!

mgoin commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP #27762

[BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP #27762

Uh oh!

Conversation

varun-sundar-rabindranath commented Oct 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan and Test Result

B200 DP

B200 TP

H100 DP

H100 TP

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

varun-sundar-rabindranath commented Oct 29, 2025

Uh oh!

varun-sundar-rabindranath commented Oct 29, 2025

Uh oh!

nvpohanh commented Oct 30, 2025

Uh oh!

nvpohanh commented Oct 30, 2025

Uh oh!

mgoin commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

varun-sundar-rabindranath commented Oct 29, 2025 •

edited by github-actions bot

Loading