Skip to content

Conversation

@zyongye
Copy link
Member

@zyongye zyongye commented Aug 26, 2025

Rebased and cleaned up version for #22907 since @varun-sundar-rabindranath is out.

Model: 120B

Reasoning Effort GPQA AIME25
Low 0.6540404 0.5458
Mid 0.71906 0.7791
High 0.79356

Benchmark on Random Datset for DEP 4

python benchmark_serving.py --model "openai/gpt-oss-120b" --dataset-name random --ignore-eos --num-prompts 2048 --random-input-len 1000 --random-output-len 1000 --port 8000 --backend vllm
============ Serving Benchmark Result ============
Successful requests:                     2048      
Benchmark duration (s):                  78.73     
Total input tokens:                      2046237   
Total generated tokens:                  2048000   
Request throughput (req/s):              26.01     
Output token throughput (tok/s):         26012.98  
Total Token throughput (tok/s):          52003.57  
---------------Time to First Token----------------
Mean TTFT (ms):                          12332.05  
Median TTFT (ms):                        12241.46  
P99 TTFT (ms):                           21703.15  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.66     
Median TPOT (ms):                        65.87     
P99 TPOT (ms):                           74.82     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.79     
Median ITL (ms):                         54.67     
P99 ITL (ms):                            303.78    
==================================================

How to run

VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve /data/xmo/yongye/models/gpt-oss-120b-hf \
--data-parallel-size 4 \
--enable-expert-parallel

Need this PR from flashinfer, as well as the kernel fix to work correctly. (cc @IwakuraRein )

Edit:
This PR is working on ToT Flashinfer

Varun Sundar Rabindranath and others added 5 commits August 25, 2025 18:51
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@mergify mergify bot added the gpt-oss Related to GPT-OSS models label Aug 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Data Parallelism and Expert Parallelism support for GPT-OSS models, particularly with the deepep-ht communication kernel and mxfp4 quantization. The changes introduce a new kernel wrapper, TrtLlmGenExperts, to leverage flashinfer's trtllm_fp4_block_scale_routed_moe kernel. While the overall structure for integrating this new path seems correct, I've identified two critical issues in the new TrtLlmGenExperts implementation that will prevent expert parallelism from functioning correctly. These issues need to be addressed to ensure the correctness and functionality of the feature.

return True

def supports_expert_map(self) -> bool:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The supports_expert_map method should return True. Currently, it returns False, which will cause a ValueError in FusedMoEModularKernel when expert parallelism is enabled (ep_size > 1), as an expert_map will be provided. This prevents the expert parallelism feature from working.

Suggested change
return False
return True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be True .

Comment on lines 86 to 185
def apply(
self,
output: torch.Tensor,
hidden_states: torch.Tensor,
w1: torch.Tensor,
w2: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
activation: str,
global_num_experts: int,
expert_map: Optional[torch.Tensor],
w1_scale: Optional[torch.Tensor],
w2_scale: Optional[torch.Tensor],
w1_zp: Optional[torch.Tensor],
w2_zp: Optional[torch.Tensor],
a1q_scale: Optional[torch.Tensor],
a2_scale: Optional[torch.Tensor],
workspace13: torch.Tensor,
workspace2: torch.Tensor,
expert_tokens_meta: Optional[mk.ExpertTokensMetadata],
apply_router_weight_on_input: bool,
):
topk = topk_ids.size(-1)
local_num_experts = w1.size(0)
intermediate_size = w2.size(1)
local_expert_offset = self.moe.ep_rank * local_num_experts

x_quant = hidden_states
x_scale = a1q_scale
if x_scale is not None:
x_scale = x_scale.view(torch.float8_e4m3fn).reshape(
*x_quant.shape[:-1], -1)

packed_tensor = (topk_ids.to(torch.int32) << 16) | topk_weights.to(
torch.bfloat16).view(torch.int16)

assert w1_scale is not None
assert w2_scale is not None
kwargs = {
"topk_ids":
packed_tensor,
"routing_bias":
None,
"hidden_states":
x_quant,
"hidden_states_scale":
x_scale,
"gemm1_weights":
w1,
"gemm1_weights_scale":
w1_scale,
"gemm1_bias":
self.w13_bias,
"gemm1_alpha":
self.gemm1_alpha,
"gemm1_beta":
self.gemm1_beta,
"gemm1_clamp_limit":
self.gemm1_clamp_limit,
"gemm2_weights":
w2,
"gemm2_weights_scale":
w2_scale,
"gemm2_bias":
self.w2_bias,
"output1_scale_scalar":
None,
"output1_scale_gate_scalar":
None,
"output2_scale_scalar":
None,
"num_experts":
global_num_experts,
"top_k":
topk,
"n_group":
None,
"topk_group":
None,
"intermediate_size":
intermediate_size,
"local_expert_offset":
local_expert_offset,
"local_num_experts":
local_num_experts,
"routed_scaling_factor":
None,
"tile_tokens_dim":
self._get_tile_tokens_dim(x_quant, topk, local_num_experts),
"routing_method_type":
1,
"do_finalize":
True,
"output":
output,
}

from flashinfer import trtllm_fp4_block_scale_routed_moe
trtllm_fp4_block_scale_routed_moe(**kwargs)
return output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a mismatch in the expert ID format. The topk_ids received by this method are local expert IDs, as they are processed by a prepare_finalize implementation (e.g., DeepEPHTPrepareAndFinalize) which maps global IDs to local ones. However, the trtllm_fp4_block_scale_routed_moe kernel seems to expect global expert IDs, especially given the presence of the local_expert_offset parameter, which is typically used to identify the range of global expert IDs managed by the current rank. Passing local IDs to a kernel expecting global IDs will result in incorrect expert selection and computation. Please ensure the kernel receives expert IDs in the expected format.

@zyongye zyongye changed the title DP/EP Support for GPT-OSS with deepep-ht comm kernel DP/EP Support for GPT-OSS with deepep-ht comm kernel on SM100 Aug 26, 2025
@zyongye zyongye changed the title DP/EP Support for GPT-OSS with deepep-ht comm kernel on SM100 DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 Aug 26, 2025
@varun-sundar-rabindranath
Copy link
Contributor

Thanks @zyongye ! LGTM !

# Note: init_prepare_finalize should only be called by
# prepare_communication_buffer_for_model.
def init_prepare_finalize(self):
def init_prepare_finalize(self, layer: Any):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layer should have torch.nn.Module type.

self,
prepare_finalize: FusedMoEPrepareAndFinalize,
moe: FusedMoEConfig,
layer: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

prepare_finalize: FusedMoEPrepareAndFinalize,
# TODO(bnell): Remove. Every layer should have an moe config object.
moe: FusedMoEConfig,
layer: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


class TrtLlmGenExperts(mk.FusedMoEPermuteExpertsUnpermute):

def __init__(self, moe: FusedMoEConfig, layer: Any):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pass the individual components here instead of the entire layer? Using the layer makes it harder to test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

) -> mk.FusedMoEPermuteExpertsUnpermute:
if (prepare_finalize.activation_format ==
mk.FusedMoEActivationFormat.BatchedExperts):
raise NotImplementedError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want a graceful fallback instead of an error you could overload maybe_make_prepare_finalize and make it return None for the unhandled cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what can we fall back to? I don't know if there's any other kernel has batched mxfp4?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback would basically be disabling the all2all communication for this layer and using the non-batched kernels but maybe erroring out would be better.

@zyongye
Copy link
Member Author

zyongye commented Aug 26, 2025

@bnellnm I updated the change. Could you please take a look?

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
self,
prepare_finalize: mk.FusedMoEPrepareAndFinalize,
moe: FusedMoEConfig,
layer: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: all these layer args should be torch.nn.Module also.

Copy link
Collaborator

@bnellnm bnellnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. All the layer arguments could be torch.nn.Module instead of Any

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye
Copy link
Member Author

zyongye commented Aug 26, 2025

Done with the nit fix. I will run some performance benchmarks tonight after the B200 is freed.

@mergify
Copy link

mergify bot commented Aug 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 27, 2025
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@mergify mergify bot removed the needs-rebase label Aug 27, 2025
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 27, 2025
Comment on lines +19 to +24
gemm1_alpha,
gemm1_beta,
gemm1_clamp_limit,
w13_bias,
w2_bias,
max_capture_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing types

@mgoin mgoin merged commit 082cc07 into vllm-project:main Aug 27, 2025
59 checks passed
@zyongye zyongye deleted the gptoss-deepep-ht branch August 27, 2025 23:11
@weireweire
Copy link
Contributor

weireweire commented Sep 8, 2025

In my test VLLM_ALL2ALL_BACKEND="deepep_high_throughput" will invalidate cuda graph, is this known issue?

update:

I found the code said DeepEP high-throughput kernels are not CUDA Graph compatible but low-latency kernels are. However low-latency kernel need batchedExperts which is not supported by mxfp4 moe.

So for now naive comm kernel is best for me, and #23964 may be better after it merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants