DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 #23608

zyongye · 2025-08-26T02:25:30Z

Rebased and cleaned up version for #22907 since @varun-sundar-rabindranath is out.

Model: 120B

Reasoning Effort	GPQA	AIME25
Low	0.6540404	0.5458
Mid	0.71906	0.7791
High	0.79356

Benchmark on Random Datset for DEP 4

python benchmark_serving.py --model "openai/gpt-oss-120b" --dataset-name random --ignore-eos --num-prompts 2048 --random-input-len 1000 --random-output-len 1000 --port 8000 --backend vllm

============ Serving Benchmark Result ============
Successful requests:                     2048      
Benchmark duration (s):                  78.73     
Total input tokens:                      2046237   
Total generated tokens:                  2048000   
Request throughput (req/s):              26.01     
Output token throughput (tok/s):         26012.98  
Total Token throughput (tok/s):          52003.57  
---------------Time to First Token----------------
Mean TTFT (ms):                          12332.05  
Median TTFT (ms):                        12241.46  
P99 TTFT (ms):                           21703.15  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.66     
Median TPOT (ms):                        65.87     
P99 TPOT (ms):                           74.82     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.79     
Median ITL (ms):                         54.67     
P99 ITL (ms):                            303.78    
==================================================

How to run

Install DeepEP
Run command

VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve /data/xmo/yongye/models/gpt-oss-120b-hf \
--data-parallel-size 4 \
--enable-expert-parallel

Need this PR from flashinfer, as well as the kernel fix to work correctly. (cc @IwakuraRein )

Edit:
This PR is working on ToT Flashinfer

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

gemini-code-assist

Code Review

This pull request adds Data Parallelism and Expert Parallelism support for GPT-OSS models, particularly with the deepep-ht communication kernel and mxfp4 quantization. The changes introduce a new kernel wrapper, TrtLlmGenExperts, to leverage flashinfer's trtllm_fp4_block_scale_routed_moe kernel. While the overall structure for integrating this new path seems correct, I've identified two critical issues in the new TrtLlmGenExperts implementation that will prevent expert parallelism from functioning correctly. These issues need to be addressed to ensure the correctness and functionality of the feature.

gemini-code-assist · 2025-08-26T02:27:42Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

+        return True
+
+    def supports_expert_map(self) -> bool:
+        return False


The supports_expert_map method should return True. Currently, it returns False, which will cause a ValueError in FusedMoEModularKernel when expert parallelism is enabled (ep_size > 1), as an expert_map will be provided. This prevents the expert parallelism feature from working.

Suggested change

return False

return True

This should be True .

gemini-code-assist · 2025-08-26T02:27:42Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

+    def apply(
+        self,
+        output: torch.Tensor,
+        hidden_states: torch.Tensor,
+        w1: torch.Tensor,
+        w2: torch.Tensor,
+        topk_weights: torch.Tensor,
+        topk_ids: torch.Tensor,
+        activation: str,
+        global_num_experts: int,
+        expert_map: Optional[torch.Tensor],
+        w1_scale: Optional[torch.Tensor],
+        w2_scale: Optional[torch.Tensor],
+        w1_zp: Optional[torch.Tensor],
+        w2_zp: Optional[torch.Tensor],
+        a1q_scale: Optional[torch.Tensor],
+        a2_scale: Optional[torch.Tensor],
+        workspace13: torch.Tensor,
+        workspace2: torch.Tensor,
+        expert_tokens_meta: Optional[mk.ExpertTokensMetadata],
+        apply_router_weight_on_input: bool,
+    ):
+        topk = topk_ids.size(-1)
+        local_num_experts = w1.size(0)
+        intermediate_size = w2.size(1)
+        local_expert_offset = self.moe.ep_rank * local_num_experts
+
+        x_quant = hidden_states
+        x_scale = a1q_scale
+        if x_scale is not None:
+            x_scale = x_scale.view(torch.float8_e4m3fn).reshape(
+                *x_quant.shape[:-1], -1)
+
+        packed_tensor = (topk_ids.to(torch.int32) << 16) | topk_weights.to(
+            torch.bfloat16).view(torch.int16)
+
+        assert w1_scale is not None
+        assert w2_scale is not None
+        kwargs = {
+            "topk_ids":
+            packed_tensor,
+            "routing_bias":
+            None,
+            "hidden_states":
+            x_quant,
+            "hidden_states_scale":
+            x_scale,
+            "gemm1_weights":
+            w1,
+            "gemm1_weights_scale":
+            w1_scale,
+            "gemm1_bias":
+            self.w13_bias,
+            "gemm1_alpha":
+            self.gemm1_alpha,
+            "gemm1_beta":
+            self.gemm1_beta,
+            "gemm1_clamp_limit":
+            self.gemm1_clamp_limit,
+            "gemm2_weights":
+            w2,
+            "gemm2_weights_scale":
+            w2_scale,
+            "gemm2_bias":
+            self.w2_bias,
+            "output1_scale_scalar":
+            None,
+            "output1_scale_gate_scalar":
+            None,
+            "output2_scale_scalar":
+            None,
+            "num_experts":
+            global_num_experts,
+            "top_k":
+            topk,
+            "n_group":
+            None,
+            "topk_group":
+            None,
+            "intermediate_size":
+            intermediate_size,
+            "local_expert_offset":
+            local_expert_offset,
+            "local_num_experts":
+            local_num_experts,
+            "routed_scaling_factor":
+            None,
+            "tile_tokens_dim":
+            self._get_tile_tokens_dim(x_quant, topk, local_num_experts),
+            "routing_method_type":
+            1,
+            "do_finalize":
+            True,
+            "output":
+            output,
+        }
+
+        from flashinfer import trtllm_fp4_block_scale_routed_moe
+        trtllm_fp4_block_scale_routed_moe(**kwargs)
+        return output


There appears to be a mismatch in the expert ID format. The topk_ids received by this method are local expert IDs, as they are processed by a prepare_finalize implementation (e.g., DeepEPHTPrepareAndFinalize) which maps global IDs to local ones. However, the trtllm_fp4_block_scale_routed_moe kernel seems to expect global expert IDs, especially given the presence of the local_expert_offset parameter, which is typically used to identify the range of global expert IDs managed by the current rank. Passing local IDs to a kernel expecting global IDs will result in incorrect expert selection and computation. Please ensure the kernel receives expert IDs in the expected format.

vllm/distributed/device_communicators/base_device_communicator.py

varun-sundar-rabindranath · 2025-08-26T11:29:10Z

Thanks @zyongye ! LGTM !

bnellnm · 2025-08-26T14:36:36Z

vllm/model_executor/layers/fused_moe/layer.py

    # Note: init_prepare_finalize should only be called by
    # prepare_communication_buffer_for_model.
-    def init_prepare_finalize(self):
+    def init_prepare_finalize(self, layer: Any):


The layer should have torch.nn.Module type.

bnellnm · 2025-08-26T14:36:54Z

vllm/model_executor/layers/fused_moe/layer.py

        self,
        prepare_finalize: FusedMoEPrepareAndFinalize,
        moe: FusedMoEConfig,
+        layer: Any,


bnellnm · 2025-08-26T14:37:01Z

vllm/model_executor/layers/fused_moe/layer.py

        prepare_finalize: FusedMoEPrepareAndFinalize,
        # TODO(bnell): Remove. Every layer should have an moe config object.
        moe: FusedMoEConfig,
+        layer: Any,


bnellnm · 2025-08-26T14:37:56Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

+
+class TrtLlmGenExperts(mk.FusedMoEPermuteExpertsUnpermute):
+
+    def __init__(self, moe: FusedMoEConfig, layer: Any):


Can you pass the individual components here instead of the entire layer? Using the layer makes it harder to test.

bnellnm · 2025-08-26T14:45:28Z

vllm/model_executor/layers/quantization/mxfp4.py

+    ) -> mk.FusedMoEPermuteExpertsUnpermute:
+        if (prepare_finalize.activation_format ==
+                mk.FusedMoEActivationFormat.BatchedExperts):
+            raise NotImplementedError(


If we want a graceful fallback instead of an error you could overload maybe_make_prepare_finalize and make it return None for the unhandled cases.

what can we fall back to? I don't know if there's any other kernel has batched mxfp4?

The fallback would basically be disabling the all2all communication for this layer and using the non-batched kernels but maybe erroring out would be better.

zyongye · 2025-08-26T16:27:45Z

@bnellnm I updated the change. Could you please take a look?

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

bnellnm · 2025-08-26T20:24:55Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

        self,
        prepare_finalize: mk.FusedMoEPrepareAndFinalize,
        moe: FusedMoEConfig,
+        layer: Any,


nit: all these layer args should be torch.nn.Module also.

bnellnm

LGTM. All the layer arguments could be torch.nn.Module instead of Any

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye · 2025-08-26T20:37:18Z

Done with the nit fix. I will run some performance benchmarks tonight after the B200 is freed.

mergify · 2025-08-27T14:20:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mgoin · 2025-08-27T14:29:40Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

+        gemm1_alpha,
+        gemm1_beta,
+        gemm1_clamp_limit,
+        w13_bias,
+        w2_bias,
+        max_capture_size,


Missing types

…roject#23608)

weireweire · 2025-09-08T03:19:55Z

In my test VLLM_ALL2ALL_BACKEND="deepep_high_throughput" will invalidate cuda graph, is this known issue?

update:

I found the code said DeepEP high-throughput kernels are not CUDA Graph compatible but low-latency kernels are. However low-latency kernel need batchedExperts which is not supported by mxfp4 moe.

So for now naive comm kernel is best for me, and #23964 may be better after it merged.

…roject#23608)

Varun Sundar Rabindranath and others added 5 commits August 25, 2025 18:51

fixes

6864bf0

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

fixe

f3ea857

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

temp fix

7aad7b8

cleanup

87f1cea

format

665966d

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners August 26, 2025 02:25

mergify bot added the gpt-oss Related to GPT-OSS models label Aug 26, 2025

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

zyongye changed the title ~~DP/EP Support for GPT-OSS with deepep-ht comm kernel~~ DP/EP Support for GPT-OSS with deepep-ht comm kernel on SM100 Aug 26, 2025

zyongye changed the title ~~DP/EP Support for GPT-OSS with deepep-ht comm kernel on SM100~~ DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 Aug 26, 2025

varun-sundar-rabindranath reviewed Aug 26, 2025

View reviewed changes

vllm/distributed/device_communicators/base_device_communicator.py Show resolved Hide resolved

bnellnm reviewed Aug 26, 2025

View reviewed changes

comment

1df5aa2

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye force-pushed the gptoss-deepep-ht branch from 0ff3447 to 1df5aa2 Compare August 26, 2025 17:10

bnellnm reviewed Aug 26, 2025

View reviewed changes

bnellnm approved these changes Aug 26, 2025

View reviewed changes

zyongye added 2 commits August 26, 2025 13:34

nit

f8775df

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

nit

2143639

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify bot added the needs-rebase label Aug 27, 2025

Merge remote-tracking branch 'upstream/main' into gptoss-deepep-ht

02521ae

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify bot removed the needs-rebase label Aug 27, 2025

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 27, 2025

mgoin approved these changes Aug 27, 2025

View reviewed changes

mgoin merged commit 082cc07 into vllm-project:main Aug 27, 2025
59 checks passed

zyongye deleted the gptoss-deepep-ht branch August 27, 2025 23:11

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 (vllm-p…

36ad5a4

…roject#23608)

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 (vllm-p…

9337ad1

…roject#23608)

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 (vllm-p…

3efe117

…roject#23608)

nvjullin mentioned this pull request Nov 3, 2025

[BugFix][Performance] Restore flashinfer autotuning for all scenarios #27904

Merged


		class TrtLlmGenExperts(mk.FusedMoEPermuteExpertsUnpermute):

		def __init__(self, moe: FusedMoEConfig, layer: Any):

Uh oh!

DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 #23608

DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 #23608

Uh oh!

Conversation

zyongye commented Aug 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zyongye commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

zyongye commented Aug 26, 2025

Uh oh!

mergify bot commented Aug 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weireweire commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zyongye commented Aug 26, 2025 •

edited by github-actions bot

Loading

weireweire commented Sep 8, 2025 •

edited

Loading