-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Kernel] DP + EP : GPTOSS + DeepEP-HighThroughput #22907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] DP + EP : GPTOSS + DeepEP-HighThroughput #22907
Conversation
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request integrates GPT-OSS with DeepEPHT, introducing a new path for Mixture-of-Experts (MoE) layers using flashinfer kernels, specifically for data and expert parallelism. It adds support for mxfp8 quantization and a new trtllm_moe layer. The changes are quite extensive. My review has identified some leftover debugging code and comments in vllm/model_executor/layers/quantization/mxfp4.py which should be removed before this pull request is merged. Given that the pull request description indicates debugging is still in progress, these findings serve as a reminder for cleanup.
| if False: | ||
| # TODO(varun) : remove before landing | ||
| return self._route_and_experts_example( | ||
| layer, x, router_logits, top_k, renormalize, use_grouped_topk, | ||
| topk_group, num_expert_group, global_num_experts, expert_map, | ||
| custom_routing_function, scoring_func, e_score_correction_bias, | ||
| apply_router_weight_on_input, activation, enable_eplb, | ||
| expert_load_view, logical_to_physical_map, | ||
| logical_replica_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| else: | ||
| #pass | ||
|
|
||
| if (envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 | ||
| or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16): | ||
| # B200 code ?? | ||
| # Quant config shouldn't be None !! | ||
| return TrtLlmGenExperts(moe) | ||
| else: | ||
| # H100 code ?? | ||
| # you use matmul_ogs kernel here! | ||
| raise NotImplementedError( | ||
| "Mxfp4 does not support non-batched experts format for EP") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else block contains leftover debugging comments and a #pass statement. These should be removed for production code to improve clarity and maintainability.
else:
if (envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8
or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16):
return TrtLlmGenExperts(moe)
else:
raise NotImplementedError(
"Mxfp4 does not support non-batched experts format for EP")|
Also a side note, please use the evaluation strategy in the recipe instead of |
|
This pull request has merge conflicts that must be resolved before it can be |
| logical_replica_count: Optional[torch.Tensor] = None | ||
| ) -> torch.Tensor: | ||
|
|
||
| topk_weights, topk_ids = FusedMoE.select_experts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topk_ids and topk_weights needs to be local and non local experts' id should be -1.
Or use global topk_ids and topk_weights, and provide local_expert_offset and local_num_experts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The topk_ids and topk_weights gets processed into,
use global topk_ids and topk_weights, and provide local_expert_offset and local_num_experts
in the all2alls. I verified that this is correct.
| None, | ||
| "tile_tokens_dim": | ||
| self._get_tile_tokens_dim(x_quant, topk, local_num_experts), | ||
| "routing_method_type": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
routing_method_type is hardcoded to renormalize. Maybe add assertion above to make sure it's not using a different routing method.
|
Update on debugging:
|
| "MX-FP8 quantization. Please install it with" \ | ||
| "`pip install flashinfer`") from err | ||
|
|
||
| return mxfp8_quantize(x, is_sf_swizzled_layout=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IwakuraRein is this the right way to quantize bf16 activations to fp8 ? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when you are using mxfp4 x mxfp8 path.
If torch.compile is using garbage values to initialize |
Purpose
Integrate GPTOSS with DeepEPHTPrepareFinalize
Commands:
server command:
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-20b --port 9010 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-cachinglm_eval command:
lm_eval --model local-completions --tasks gsm8k --model_args model=openai/gpt-oss-20b,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100Issue: The server some times hangs / reports IMA. When the server runs through the lm_eval outputs are good. They look like
and match main TP.
Debugging:
This PR uses
trtllm_fp4_block_scale_routed_moefrom flashinfer. I narrowed the issue down to the flashinfer kernel.mPtrExpertCountsis not big enough.I am still debugging this.
Test Plan
Test Result
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.