[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction #24831

skyloevil · 2025-09-14T13:56:28Z

Purpose

Optimize MoE (Mixture of Experts) performance by improving dispatch efficiency and workspace memory allocation. The changes address under-allocation issues during tensor parallel operations.

Changes

Restrict all2all dispatch to TP leaders: Limit expensive all-to-all communication operations to tensor parallel leaders only, reducing redundant communication overhead.
Improved workspace sizing:
- Base capacity calculation on aq.size(1) (backend T dimension)
- Guard against under-allocation when TP > 1 by using observed M values
- Prevent _resize_cache under-allocation during distributed operations
Debug logging: Added one-time debug logging for total tokens and workspace comments to aid in performance analysis and troubleshooting.

Test Plan

Command:

VLLM_MOE_DP_CHUNK_SIZE=32 VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1  vllm serve Qwen/Qwen3-30B-A3B-FP8  --trust-remote-code  --data-parallel-size 2 --tensor-parallel-size 2 --enable-expert-parallel --port 9010  --no-enable-prefix-caching

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

Results:

lm_eval result

4 GPU H100 80GB HBM3

moe log

server log

Files Modified

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py
vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py
vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py
vllm/model_executor/layers/fused_moe/layer.py

Related Work

This PR builds upon and is related to:

Optimize MoE Token Dispatch for Tensor Parallel Configurations #22993: Implements core TP leader dispatch optimization strategy, reducing cross-rank communication overhead by 2x-8x
[BugFix] : Fix Batched DeepGemm Experts #19515: Provides foundation for MoE workspace allocation fixes and debugging improvements

…g log for total tokens and update workspace comment Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

…o avoid _resize_cache under-allocation when TP>1; log once when expanding max_num_tokens Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

…e(1) (backend T) and max with config/observed; fixes under-allocation during LL/PPLX and profile runs Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

mergify · 2025-09-14T13:57:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces several key optimizations for Fused MoE layers, primarily by restricting all-to-all dispatch to Tensor Parallel leaders and improving workspace memory allocation. The logic for these optimizations appears correct and well-implemented. The added debug logging is also a welcome addition for better observability.

However, I've identified a significant maintainability concern with code duplication for the TP leader dispatch logic. This logic is copied in both the high-throughput and low-latency DeepEP finalization files. I've left a comment with a suggestion to refactor this into a shared utility to prevent potential future bugs. Addressing this will make the codebase more robust.

gemini-code-assist · 2025-09-14T13:57:29Z

vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py

+        # Only DP leader ranks (tp_rank == 0) should dispatch when TP > 1.
+        tp_world_size = get_tensor_model_parallel_world_size()
+        tp_rank_in_group = get_tp_group(
+        ).rank_in_group if tp_world_size > 1 else 0
+        if tp_world_size > 1 and tp_rank_in_group != 0:
+            # Non-leader TP ranks send zero tokens to avoid duplicate dispatch.
+            a1 = a1[:0]
+            topk_ids = topk_ids[:0]
+            topk_weights = topk_weights[:0]


This block of code for restricting dispatch to the Tensor Parallel leader is duplicated in vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py (lines 152-160).

Duplicating this logic creates a maintainability issue. If this logic needs to be updated in the future, it's easy to forget to update it in both places, which could lead to subtle and hard-to-debug inconsistencies between the high-throughput and low-latency MoE paths.

To improve maintainability and reduce the risk of future bugs, I recommend refactoring this logic into a shared utility function. For example:

# In a shared utility module def restrict_dispatch_to_tp_leader(*tensors: torch.Tensor) -> tuple[torch.Tensor, ...]: """If the current rank is not a TP leader, returns empty tensors.""" tp_world_size = get_tensor_model_parallel_world_size() if tp_world_size <= 1: return tensors tp_rank_in_group = get_tp_group().rank_in_group if tp_rank_in_group != 0: return tuple(t[:0] for t in tensors) return tensors # In prepare_async methods a1, topk_ids, topk_weights = restrict_dispatch_to_tp_leader( a1, topk_ids, topk_weights )

This would centralize the logic, making the code cleaner and safer to maintain.

Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>

robertgshaw2-redhat · 2025-09-14T20:55:16Z

cc @tlrmchlsmth @bnellnm - I believe we are already using SP in this case, but correct me if Im wrong.

tlrmchlsmth · 2025-09-15T14:29:43Z

cc @tlrmchlsmth @bnellnm - I believe we are already using SP in this case, but correct me if Im wrong.

We use SP for DeepSeekV3 (ntroduced in #24134), but we need to do it for Qwen and other MoE models. I'll put up a PR today

bnellnm · 2025-09-15T15:08:05Z

I think @varun-sundar-rabindranath should check out this PR as well.

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

bnellnm · 2025-09-15T15:19:14Z

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

+        # Tokens-per-expert capacity actually used by the backend for this
+        # call. For batched formats (DeepEP-LL / PPLX), aq has shape
+        #   (E, T_backend, K)
+        # Prefer using aq.size(1) to avoid under-allocation during dummy/profile
+        # runs or when multiple dispatchers/ranks contribute tokens.
+        T_backend = aq.size(1) if aq.dim() == 3 else 0
+
+        # Fallback capacity from configuration/observation.
+        num_dispatchers = self.num_dispatchers
+        observed_M = a.size(0)
+        if self.max_num_tokens is None:
+            T_cfg = observed_M * num_dispatchers
+        else:
+            # Guard with observed_M to avoid under-estimation when TP>1 or
+            # during profiling runs.
+            max_num_tokens = max(self.max_num_tokens, observed_M)
+            if observed_M > self.max_num_tokens:
+                with contextlib.suppress(Exception):
+                    logger.debug_once(
+                        "[MoE Debug] Increasing workspace max_num_tokens "
+                        "from configured=%d to observed=%d to avoid OOM. "
+                        "(num_dispatchers=%d, E=%d, N=%d, K=%d)",
+                        self.max_num_tokens, observed_M, num_dispatchers,
+                        num_experts, N, K)
+            T_cfg = max_num_tokens * num_dispatchers
+
+        # Final capacity: honor backend's requested T if larger.
+        T_eff = max(T_backend, T_cfg)
+
+        workspace13 = (num_experts, T_eff, max(K, N))
+        workspace2 = (num_experts, T_eff, (N // 2))
+        output = (num_experts, T_eff, K)


I think these changes might be redundant with another PR that makes the workspaces effectively global. See #23693

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

mergify · 2025-09-16T17:59:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>

skyloevil · 2025-09-20T02:14:23Z

@varun-sundar-rabindranath PTAL..Thanks.

mergify · 2025-09-23T22:52:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth · 2025-09-24T01:18:32Z

@skyloevil this should be covered by #24982, thank you!

skyloevil added 4 commits September 13, 2025 13:53

fused_moe: restrict all2all dispatch to TP leaders; add one-time debu…

38c183b

…g log for total tokens and update workspace comment Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

style: apply isort auto-fix after contextlib import

79c7df5

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

moe: BatchedDeepGemmExperts workspace sizing: guard with observed M t…

7ca6147

…o avoid _resize_cache under-allocation when TP>1; log once when expanding max_num_tokens Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

moe: BatchedDeepGemmExperts workspace sizing: base capacity on aq.siz…

28fbc19

…e(1) (backend T) and max with config/observed; fixes under-allocation during LL/PPLX and profile runs Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

mergify bot added the needs-rebase label Sep 14, 2025

gemini-code-assist bot reviewed Sep 14, 2025

View reviewed changes

Merge branch 'main' into moe-tp-leader-dispatch-logging

c2b9de3

Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>

mergify bot removed the needs-rebase label Sep 14, 2025

bnellnm reviewed Sep 15, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py Outdated Show resolved Hide resolved

bnellnm reviewed Sep 15, 2025

View reviewed changes

isort: reorder imports in fused_moe/utils.py

db3cb12

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>

mergify bot added the needs-rebase label Sep 16, 2025

Merge branch 'main' into moe-tp-leader-dispatch-logging

d993cde

Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>

skyloevil requested a review from mgoin as a code owner September 17, 2025 16:35

mergify bot removed the needs-rebase label Sep 17, 2025

Merge branch 'main' into moe-tp-leader-dispatch-logging

b97d117

Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>

Merge branch 'main' into moe-tp-leader-dispatch-logging

2fa7658

mergify bot added the needs-rebase label Sep 23, 2025

tlrmchlsmth closed this Sep 24, 2025

Uh oh!

[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction #24831

[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction #24831

Uh oh!

Conversation

skyloevil commented Sep 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Files Modified

Related Work

Uh oh!

mergify bot commented Sep 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Sep 14, 2025

Uh oh!

tlrmchlsmth commented Sep 15, 2025

Uh oh!

bnellnm commented Sep 15, 2025

Uh oh!

Uh oh!

bnellnm Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

skyloevil commented Sep 20, 2025

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

tlrmchlsmth commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

skyloevil commented Sep 14, 2025 •

edited by github-actions bot

Loading