Skip to content

Conversation

@skyloevil
Copy link
Contributor

@skyloevil skyloevil commented Sep 14, 2025

Purpose

Optimize MoE (Mixture of Experts) performance by improving dispatch efficiency and workspace memory allocation. The changes address under-allocation issues during tensor parallel operations.

Changes

  1. Restrict all2all dispatch to TP leaders: Limit expensive all-to-all communication operations to tensor parallel leaders only, reducing redundant communication overhead.
  2. Improved workspace sizing:
    - Base capacity calculation on aq.size(1) (backend T dimension)
    - Guard against under-allocation when TP > 1 by using observed M values
    - Prevent _resize_cache under-allocation during distributed operations
  3. Debug logging: Added one-time debug logging for total tokens and workspace comments to aid in performance analysis and troubleshooting.

Test Plan

Command:

VLLM_MOE_DP_CHUNK_SIZE=32 VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1  vllm serve Qwen/Qwen3-30B-A3B-FP8  --trust-remote-code  --data-parallel-size 2 --tensor-parallel-size 2 --enable-expert-parallel --port 9010  --no-enable-prefix-caching
lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3,tokenized_requests=False --limit 100

Results:

lm_eval result
wechat_2025-09-14_212827_703
4 GPU H100 80GB HBM3
微信图片_20250914210342_48_159
moe log
微信图片_20250914210233_47_159

server log
wechat_2025-09-14_213418_093

Files Modified

  • vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py
  • vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py
  • vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py
  • vllm/model_executor/layers/fused_moe/layer.py

Related Work

This PR builds upon and is related to:

…g log for total tokens and update workspace comment

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
…o avoid _resize_cache under-allocation when TP>1; log once when expanding max_num_tokens

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
…e(1) (backend T) and max with config/observed; fixes under-allocation during LL/PPLX and profile runs

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
@mergify
Copy link

mergify bot commented Sep 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 14, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several key optimizations for Fused MoE layers, primarily by restricting all-to-all dispatch to Tensor Parallel leaders and improving workspace memory allocation. The logic for these optimizations appears correct and well-implemented. The added debug logging is also a welcome addition for better observability.

However, I've identified a significant maintainability concern with code duplication for the TP leader dispatch logic. This logic is copied in both the high-throughput and low-latency DeepEP finalization files. I've left a comment with a suggestion to refactor this into a shared utility to prevent potential future bugs. Addressing this will make the codebase more robust.

Comment on lines 198 to 206
# Only DP leader ranks (tp_rank == 0) should dispatch when TP > 1.
tp_world_size = get_tensor_model_parallel_world_size()
tp_rank_in_group = get_tp_group(
).rank_in_group if tp_world_size > 1 else 0
if tp_world_size > 1 and tp_rank_in_group != 0:
# Non-leader TP ranks send zero tokens to avoid duplicate dispatch.
a1 = a1[:0]
topk_ids = topk_ids[:0]
topk_weights = topk_weights[:0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code for restricting dispatch to the Tensor Parallel leader is duplicated in vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py (lines 152-160).

Duplicating this logic creates a maintainability issue. If this logic needs to be updated in the future, it's easy to forget to update it in both places, which could lead to subtle and hard-to-debug inconsistencies between the high-throughput and low-latency MoE paths.

To improve maintainability and reduce the risk of future bugs, I recommend refactoring this logic into a shared utility function. For example:

# In a shared utility module
def restrict_dispatch_to_tp_leader(*tensors: torch.Tensor) -> tuple[torch.Tensor, ...]:
    """If the current rank is not a TP leader, returns empty tensors."""
    tp_world_size = get_tensor_model_parallel_world_size()
    if tp_world_size <= 1:
        return tensors

    tp_rank_in_group = get_tp_group().rank_in_group
    if tp_rank_in_group != 0:
        return tuple(t[:0] for t in tensors)
    
    return tensors

# In prepare_async methods
a1, topk_ids, topk_weights = restrict_dispatch_to_tp_leader(
    a1, topk_ids, topk_weights
)

This would centralize the logic, making the code cleaner and safer to maintain.

Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
@mergify mergify bot removed the needs-rebase label Sep 14, 2025
@robertgshaw2-redhat
Copy link
Collaborator

cc @tlrmchlsmth @bnellnm - I believe we are already using SP in this case, but correct me if Im wrong.

@tlrmchlsmth
Copy link
Member

cc @tlrmchlsmth @bnellnm - I believe we are already using SP in this case, but correct me if Im wrong.

We use SP for DeepSeekV3 (ntroduced in #24134), but we need to do it for Qwen and other MoE models. I'll put up a PR today

@bnellnm
Copy link
Collaborator

bnellnm commented Sep 15, 2025

I think @varun-sundar-rabindranath should check out this PR as well.

Comment on lines +272 to +303
# Tokens-per-expert capacity actually used by the backend for this
# call. For batched formats (DeepEP-LL / PPLX), aq has shape
# (E, T_backend, K)
# Prefer using aq.size(1) to avoid under-allocation during dummy/profile
# runs or when multiple dispatchers/ranks contribute tokens.
T_backend = aq.size(1) if aq.dim() == 3 else 0

# Fallback capacity from configuration/observation.
num_dispatchers = self.num_dispatchers
observed_M = a.size(0)
if self.max_num_tokens is None:
T_cfg = observed_M * num_dispatchers
else:
# Guard with observed_M to avoid under-estimation when TP>1 or
# during profiling runs.
max_num_tokens = max(self.max_num_tokens, observed_M)
if observed_M > self.max_num_tokens:
with contextlib.suppress(Exception):
logger.debug_once(
"[MoE Debug] Increasing workspace max_num_tokens "
"from configured=%d to observed=%d to avoid OOM. "
"(num_dispatchers=%d, E=%d, N=%d, K=%d)",
self.max_num_tokens, observed_M, num_dispatchers,
num_experts, N, K)
T_cfg = max_num_tokens * num_dispatchers

# Final capacity: honor backend's requested T if larger.
T_eff = max(T_backend, T_cfg)

workspace13 = (num_experts, T_eff, max(K, N))
workspace2 = (num_experts, T_eff, (N // 2))
output = (num_experts, T_eff, K)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes might be redundant with another PR that makes the workspaces effectively global. See #23693

Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
@mergify
Copy link

mergify bot commented Sep 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 16, 2025
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
@skyloevil skyloevil requested a review from mgoin as a code owner September 17, 2025 16:35
@mergify mergify bot removed the needs-rebase label Sep 17, 2025
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
@skyloevil
Copy link
Contributor Author

@varun-sundar-rabindranath PTAL..Thanks.

@mergify
Copy link

mergify bot commented Sep 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @skyloevil.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 23, 2025
@tlrmchlsmth
Copy link
Member

@skyloevil this should be covered by #24982, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants