-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction #24831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction #24831
Conversation
…g log for total tokens and update workspace comment Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
…o avoid _resize_cache under-allocation when TP>1; log once when expanding max_num_tokens Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
…e(1) (backend T) and max with config/observed; fixes under-allocation during LL/PPLX and profile runs Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several key optimizations for Fused MoE layers, primarily by restricting all-to-all dispatch to Tensor Parallel leaders and improving workspace memory allocation. The logic for these optimizations appears correct and well-implemented. The added debug logging is also a welcome addition for better observability.
However, I've identified a significant maintainability concern with code duplication for the TP leader dispatch logic. This logic is copied in both the high-throughput and low-latency DeepEP finalization files. I've left a comment with a suggestion to refactor this into a shared utility to prevent potential future bugs. Addressing this will make the codebase more robust.
| # Only DP leader ranks (tp_rank == 0) should dispatch when TP > 1. | ||
| tp_world_size = get_tensor_model_parallel_world_size() | ||
| tp_rank_in_group = get_tp_group( | ||
| ).rank_in_group if tp_world_size > 1 else 0 | ||
| if tp_world_size > 1 and tp_rank_in_group != 0: | ||
| # Non-leader TP ranks send zero tokens to avoid duplicate dispatch. | ||
| a1 = a1[:0] | ||
| topk_ids = topk_ids[:0] | ||
| topk_weights = topk_weights[:0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for restricting dispatch to the Tensor Parallel leader is duplicated in vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py (lines 152-160).
Duplicating this logic creates a maintainability issue. If this logic needs to be updated in the future, it's easy to forget to update it in both places, which could lead to subtle and hard-to-debug inconsistencies between the high-throughput and low-latency MoE paths.
To improve maintainability and reduce the risk of future bugs, I recommend refactoring this logic into a shared utility function. For example:
# In a shared utility module
def restrict_dispatch_to_tp_leader(*tensors: torch.Tensor) -> tuple[torch.Tensor, ...]:
"""If the current rank is not a TP leader, returns empty tensors."""
tp_world_size = get_tensor_model_parallel_world_size()
if tp_world_size <= 1:
return tensors
tp_rank_in_group = get_tp_group().rank_in_group
if tp_rank_in_group != 0:
return tuple(t[:0] for t in tensors)
return tensors
# In prepare_async methods
a1, topk_ids, topk_weights = restrict_dispatch_to_tp_leader(
a1, topk_ids, topk_weights
)This would centralize the logic, making the code cleaner and safer to maintain.
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
|
cc @tlrmchlsmth @bnellnm - I believe we are already using SP in this case, but correct me if Im wrong. |
We use SP for DeepSeekV3 (ntroduced in #24134), but we need to do it for Qwen and other MoE models. I'll put up a PR today |
|
I think @varun-sundar-rabindranath should check out this PR as well. |
vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py
Outdated
Show resolved
Hide resolved
| # Tokens-per-expert capacity actually used by the backend for this | ||
| # call. For batched formats (DeepEP-LL / PPLX), aq has shape | ||
| # (E, T_backend, K) | ||
| # Prefer using aq.size(1) to avoid under-allocation during dummy/profile | ||
| # runs or when multiple dispatchers/ranks contribute tokens. | ||
| T_backend = aq.size(1) if aq.dim() == 3 else 0 | ||
|
|
||
| # Fallback capacity from configuration/observation. | ||
| num_dispatchers = self.num_dispatchers | ||
| observed_M = a.size(0) | ||
| if self.max_num_tokens is None: | ||
| T_cfg = observed_M * num_dispatchers | ||
| else: | ||
| # Guard with observed_M to avoid under-estimation when TP>1 or | ||
| # during profiling runs. | ||
| max_num_tokens = max(self.max_num_tokens, observed_M) | ||
| if observed_M > self.max_num_tokens: | ||
| with contextlib.suppress(Exception): | ||
| logger.debug_once( | ||
| "[MoE Debug] Increasing workspace max_num_tokens " | ||
| "from configured=%d to observed=%d to avoid OOM. " | ||
| "(num_dispatchers=%d, E=%d, N=%d, K=%d)", | ||
| self.max_num_tokens, observed_M, num_dispatchers, | ||
| num_experts, N, K) | ||
| T_cfg = max_num_tokens * num_dispatchers | ||
|
|
||
| # Final capacity: honor backend's requested T if larger. | ||
| T_eff = max(T_backend, T_cfg) | ||
|
|
||
| workspace13 = (num_experts, T_eff, max(K, N)) | ||
| workspace2 = (num_experts, T_eff, (N // 2)) | ||
| output = (num_experts, T_eff, K) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes might be redundant with another PR that makes the workspaces effectively global. See #23693
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
|
@varun-sundar-rabindranath PTAL..Thanks. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@skyloevil this should be covered by #24982, thank you! |
Purpose
Optimize MoE (Mixture of Experts) performance by improving dispatch efficiency and workspace memory allocation. The changes address under-allocation issues during tensor parallel operations.
Changes
- Base capacity calculation on aq.size(1) (backend T dimension)
- Guard against under-allocation when TP > 1 by using observed M values
- Prevent _resize_cache under-allocation during distributed operations
Test Plan
Command:
VLLM_MOE_DP_CHUNK_SIZE=32 VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --data-parallel-size 2 --tensor-parallel-size 2 --enable-expert-parallel --port 9010 --no-enable-prefix-cachingResults:




lm_eval result
4 GPU H100 80GB HBM3
moe log
server log

Files Modified
Related Work
This PR builds upon and is related to: