[Kernel] DP + EP : GPTOSS + DeepEP-HighThroughput #22907

varun-sundar-rabindranath · 2025-08-14T14:16:14Z

Purpose

Integrate GPTOSS with DeepEPHTPrepareFinalize

Commands:
server command: VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-20b --port 9010 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching
lm_eval command: lm_eval --model local-completions --tasks gsm8k --model_args model=openai/gpt-oss-20b,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Issue: The server some times hangs / reports IMA. When the server runs through the lm_eval outputs are good. They look like

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.31|±  |0.0465|
|     |       |strict-match    |     5|exact_match|↑  | 0.24|±  |0.0429|

and match main TP.

Debugging:
This PR uses trtllm_fp4_block_scale_routed_moe from flashinfer. I narrowed the issue down to the flashinfer kernel.

One issue is that link needs to be 256, otherwise it'll blow up at link as mPtrExpertCounts is not big enough.
I am still debugging this.

Test Plan

Test Result

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

github-actions · 2025-08-14T14:16:22Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request integrates GPT-OSS with DeepEPHT, introducing a new path for Mixture-of-Experts (MoE) layers using flashinfer kernels, specifically for data and expert parallelism. It adds support for mxfp8 quantization and a new trtllm_moe layer. The changes are quite extensive. My review has identified some leftover debugging code and comments in vllm/model_executor/layers/quantization/mxfp4.py which should be removed before this pull request is merged. Given that the pull request description indicates debugging is still in progress, these findings serve as a reminder for cleanup.

gemini-code-assist · 2025-08-14T14:17:12Z

vllm/model_executor/layers/quantization/mxfp4.py

+        if False:
+            # TODO(varun) : remove before landing
+            return self._route_and_experts_example(
+                layer, x, router_logits, top_k, renormalize, use_grouped_topk,
+                topk_group, num_expert_group, global_num_experts, expert_map,
+                custom_routing_function, scoring_func, e_score_correction_bias,
+                apply_router_weight_on_input, activation, enable_eplb,
+                expert_load_view, logical_to_physical_map,
+                logical_replica_count)


This block of code is disabled with if False: and contains a TODO to remove it before landing. This debugging code should be removed from the final version of the pull request.

gemini-code-assist · 2025-08-14T14:17:12Z

vllm/model_executor/layers/quantization/mxfp4.py

+        else:
+            #pass
+
+            if (envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8
+                    or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16):
+                # B200 code ??
+                # Quant config shouldn't be None !!
+                return TrtLlmGenExperts(moe)
+            else:
+                # H100 code ??
+                # you use matmul_ogs kernel here!
+                raise NotImplementedError(
+                    "Mxfp4 does not support non-batched experts format for EP")


This else block contains leftover debugging comments and a #pass statement. These should be removed for production code to improve clarity and maintainability.

else: if (envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16): return TrtLlmGenExperts(moe) else: raise NotImplementedError( "Mxfp4 does not support non-batched experts format for EP")

zyongye · 2025-08-14T18:51:32Z

Also a side note, please use the evaluation strategy in the recipe instead of lm_eval for this model.

mergify · 2025-08-14T20:11:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/quantization/mxfp4.py

IwakuraRein · 2025-08-14T22:59:01Z

vllm/model_executor/layers/quantization/mxfp4.py

+            logical_replica_count: Optional[torch.Tensor] = None
+    ) -> torch.Tensor:
+
+        topk_weights, topk_ids = FusedMoE.select_experts(


topk_ids and topk_weights needs to be local and non local experts' id should be -1.

Or use global topk_ids and topk_weights, and provide local_expert_offset and local_num_experts.

The topk_ids and topk_weights gets processed into,

use global topk_ids and topk_weights, and provide local_expert_offset and local_num_experts
in the all2alls. I verified that this is correct.

IwakuraRein · 2025-08-14T23:04:58Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

+            None,
+            "tile_tokens_dim":
+            self._get_tile_tokens_dim(x_quant, topk, local_num_experts),
+            "routing_method_type":


routing_method_type is hardcoded to renormalize. Maybe add assertion above to make sure it's not using a different routing method.

varun-sundar-rabindranath · 2025-08-15T12:23:10Z

Update on debugging:

I got VLLM_USE_FLASHINFER_MOE_MXFP4_BF16 working reliably by explicitly adding the --enforce-eager option.
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 still locks up. I have narrowed it down to mPermuteGemm1.run but afaict all the setup is correct and the permute indices inside is also computed correctly. looking into this further.
hunch: I think the interaction with torch compile might have something to do with this.

varun-sundar-rabindranath · 2025-08-15T13:29:39Z

vllm/model_executor/layers/quantization/utils/mxfp8_utils.py

+                          "MX-FP8 quantization. Please install it with" \
+                          "`pip install flashinfer`") from err
+
+    return mxfp8_quantize(x, is_sf_swizzled_layout=False)


@IwakuraRein is this the right way to quantize bf16 activations to fp8 ? Thanks.

Yes, when you are using mxfp4 x mxfp8 path.

IwakuraRein · 2025-08-18T20:42:23Z

hunch: I think the interaction with torch compile might have something to do with this.

If torch.compile is using garbage values to initialize routing_logits and/or topk_ids then there might be Illegal memory access.

Varun Sundar Rabindranath added 8 commits August 8, 2025 23:12

add prepare_finalize and fused experts code

16204bb

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

wip

54378a2

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

add select_gemm_impl

f1fb635

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

not crashing

2e5e196

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

fixes

765f641

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

cleanups

b3d23c9

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

trtllm_moe cleanup

179595c

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

fixes

0a6ec87

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >

varun-sundar-rabindranath requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners August 14, 2025 14:16

varun-sundar-rabindranath marked this pull request as draft August 14, 2025 14:16

mergify bot added the gpt-oss Related to GPT-OSS models label Aug 14, 2025

gemini-code-assist bot reviewed Aug 14, 2025

View reviewed changes

mergify bot added the needs-rebase label Aug 14, 2025

IwakuraRein reviewed Aug 14, 2025

View reviewed changes

varun-sundar-rabindranath commented Aug 15, 2025

View reviewed changes

mgoin changed the title ~~[Kernel] DP + EP : GPTOSS + Deepep-HightThroughput~~ [Kernel] DP + EP : GPTOSS + DeepEP-HighThroughput Aug 18, 2025

zyongye mentioned this pull request Aug 26, 2025

DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 #23608

Merged

varun-sundar-rabindranath closed this Sep 10, 2025

Uh oh!

[Kernel] DP + EP : GPTOSS + DeepEP-HighThroughput #22907

[Kernel] DP + EP : GPTOSS + DeepEP-HighThroughput #22907

Uh oh!

Conversation

varun-sundar-rabindranath commented Aug 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

zyongye commented Aug 14, 2025

Uh oh!

mergify bot commented Aug 14, 2025

Uh oh!

Uh oh!

IwakuraRein Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

IwakuraRein Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Aug 15, 2025

Uh oh!

varun-sundar-rabindranath Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

IwakuraRein Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

IwakuraRein commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

varun-sundar-rabindranath commented Aug 14, 2025 •

edited by github-actions bot

Loading