[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE #24134

tlrmchlsmth · 2025-09-03T00:34:39Z

Purpose

Currently, when running attention with TP and using --enable-expert-parallel, the MoE layers will do duplicate work when using DeepEP. In this case, the output of attention will be replicated across TP ranks and each token copy will be dispatched to the EP ranks it gets routed to, multiplying the amount of work by tp_size.

This PR avoids this duplicate work by ensuring the input to the MoE layer is sequence parallel instead of replicated.

Notes:

The performance bug applies to any MoE model but this PR only fixes it for DeepSeekV3
We can do better by replacing the all_reduce at the end of attention with a reduce_scatter. This reduces the amount of computation but is a little more invasive to the model definition since we need to handle the sharding of the residuals. This has an extra effect of de-duplicating the work done during layer norms (minor improvement).

Test Plan

lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-R1-0528,base_url=http://infra-wide-ep-inference-gateway-istio.llm-d-wide-ep.svc.cluster.local/v1/completions,num_concurrent=1000,max_retries=3,tokenized_requests=False

Test Result

This PR

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9583|±  |0.0055|
|     |       |strict-match    |     5|exact_match|↑  |0.9560|±  |0.0056|

5b31cb1 (last good commit before #24119 landed):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9591|±  |0.0055|
|     |       |strict-match    |     5|exact_match|↑  |0.9568|±  |0.0056|

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

vllm/model_executor/models/deepseek_v2.py

gemini-code-assist

Code Review

This pull request correctly introduces sequence parallelism to the MoE layer in the DeepseekV2 model to prevent redundant computations when using both Tensor Parallelism and Expert Parallelism. The approach of chunking the input before the MoE layer and gathering the output afterward is sound. I've found one critical issue that could lead to a runtime error, which I've detailed in a specific comment.

vllm/model_executor/models/deepseek_v2.py

tlrmchlsmth · 2025-09-03T00:45:17Z

vllm/model_executor/models/deepseek_v2.py

+        # If using expert parallel, ensure the input to the experts is
+        # SP to avoid duplicate work.
+        # Not needed for pplx-kernels as it can handle duplicate input tokens.


@abcdabcd987 and @nandor could you double-check me here: Can pplx handle replicated input tokens in the TP attn + EP MoE case?

vllm/model_executor/models/deepseek_v2.py

robertgshaw2-redhat · 2025-09-03T00:46:36Z

vllm/model_executor/models/deepseek_v2.py

+        # If using expert parallel, ensure the input to the experts is
+        # SP to avoid duplicate work.
+        # Not needed for pplx-kernels as it can handle duplicate input tokens.
+        self.is_sequence_parallel = (envs.VLLM_ALL2ALL_BACKEND


I think we should call this use_sequence_parallel_mlp since we use seq parallelism for just the mlp layer here

Not sure I like this because the MLP being sequence parallel is kind of a side effect. And we need to pass it into the fused_moe layer for the chunking.

I'm not a fan of the sequence_parallel name though so definitely open to suggestions

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

vllm/model_executor/models/deepseek_v2.py

robertgshaw2-redhat · 2025-09-03T00:54:54Z

This looks good to me. Just left some comments on explaining the parallelism setup.

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

robertgshaw2-redhat · 2025-09-03T02:15:11Z

there are genuine failures in the CI related to DeepSeek MTP

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify · 2025-09-03T16:40:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth · 2025-09-03T19:30:16Z

Seeing some issues with CUDA graphs on:

ms-wide-ep-llm-d-modelservice-decode-0-1 vllm-worker-decode (EngineCore_7 pid=290) (VllmWorker TP0 pid=314) ERROR 09-03 19:20:29 [multiproc_executor.py:611]   File "/workspace/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1663, in process_chunk
ms-wide-ep-llm-d-modelservice-decode-0-1 vllm-worker-decode (EngineCore_7 pid=290) (VllmWorker TP0 pid=314) ERROR 09-03 19:20:29 [multiproc_executor.py:611]     staged_hidden_states.copy_(hidden_states, non_blocking=True)
ms-wide-ep-llm-d-modelservice-decode-0-1 vllm-worker-decode (EngineCore_7 pid=290) (VllmWorker TP0 pid=314) ERROR 09-03 19:20:29 [multiproc_executor.py:611] RuntimeError: output with shape [1, 7168] doesn't match the broadcast shape [0, 7168]

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth · 2025-09-04T21:27:01Z

Seeing some issues with CUDA graphs on:

ms-wide-ep-llm-d-modelservice-decode-0-1 vllm-worker-decode (EngineCore_7 pid=290) (VllmWorker TP0 pid=314) ERROR 09-03 19:20:29 [multiproc_executor.py:611]   File "/workspace/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1663, in process_chunk
ms-wide-ep-llm-d-modelservice-decode-0-1 vllm-worker-decode (EngineCore_7 pid=290) (VllmWorker TP0 pid=314) ERROR 09-03 19:20:29 [multiproc_executor.py:611]     staged_hidden_states.copy_(hidden_states, non_blocking=True)
ms-wide-ep-llm-d-modelservice-decode-0-1 vllm-worker-decode (EngineCore_7 pid=290) (VllmWorker TP0 pid=314) ERROR 09-03 19:20:29 [multiproc_executor.py:611] RuntimeError: output with shape [1, 7168] doesn't match the broadcast shape [0, 7168]

It was a torch.compile issue. I ended up having to work around it by wrapping the sp chunking code in a custom op

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

nvpohanh · 2025-09-05T01:50:28Z

cc @weireweire

tlrmchlsmth · 2025-09-05T16:33:17Z

vllm/model_executor/models/deepseek_v2.py

+# Chunk x along the num_tokens axis for sequence parallelism
+# NOTE: This is wrapped in a torch custom op to work around the following issue:
+# The output tensor can have a sequence length 0 at small input sequence lengths
+# even though we explicitly pad to avoid this.
+def sequence_parallel_chunk(x: torch.Tensor) -> torch.Tensor:
+    tp_size = get_tensor_model_parallel_world_size()
+    tp_rank = get_tensor_model_parallel_rank()
+
+    # all_gather needs the sequence length to be divisible by tp_size
+    seq_len = x.size(0)
+    remainder = seq_len % tp_size
+    if remainder != 0:
+        pad_len = tp_size - remainder
+        x = nn.functional.pad(x, (0, 0, 0, pad_len))
+
+    chunk = x.shape[0] // tp_size
+    start = tp_rank * chunk
+    return torch.narrow(x, 0, start, chunk)
+
+
+def sequence_parallel_chunk_fake(x: torch.Tensor) -> torch.Tensor:
+    tp_size = get_tensor_model_parallel_world_size()
+    seq_len = cdiv(x.size(0), tp_size)
+    shape = list(x.shape)
+    shape[0] = seq_len
+    out = torch.empty(shape, dtype=x.dtype, device=x.device)
+    return out
+
+
+direct_register_custom_op(
+    op_name="sequence_parallel_chunk",
+    op_func=sequence_parallel_chunk,
+    mutates_args=[],
+    fake_impl=sequence_parallel_chunk_fake,
+    dispatch_key=current_platform.dispatch_key,
+    tags=(torch.Tag.needs_fixed_stride_order, ),
+)


cc @zou3519 @ProExpertProg in case you have better ideas than this wrap-it-in-a-custom-op hack

@tlrmchlsmth do you have the original error message and/or a stack trace?

Also, this custom operator is technically incorrect. The output is a view of the input, which means bad things can happen in the presence of mutation. I don't know if vLLM specifically will hit any of those issues, it depends on how it's being used.

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

tlrmchlsmth added 7 commits September 2, 2025 22:36

Simplified version of TP Attn + EP MoE perf bug fix

9cf5b21

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

turn off chunking

46e4ad7

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

simplify tp padding

90a2a32

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

replicated linear shared expert

85a1ff0

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

fixup moe chunking

08bace6

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

cleanup and plumbing

8174f70

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

cleanup and comments

02fbb25

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify bot added the deepseek Related to DeepSeek models label Sep 3, 2025

robertgshaw2-redhat reviewed Sep 3, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Sep 3, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

tlrmchlsmth commented Sep 3, 2025

View reviewed changes

robertgshaw2-redhat reviewed Sep 3, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Show resolved Hide resolved

robertgshaw2-redhat reviewed Sep 3, 2025

View reviewed changes

review comment

76372f2

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

robertgshaw2-redhat reviewed Sep 3, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

robertgshaw2-redhat approved these changes Sep 3, 2025

View reviewed changes

tlrmchlsmth added 2 commits September 3, 2025 01:08

Merge branch 'main' into tp_attn_fix_simple

3b6b3c7

comments

249eb5f

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2025

fixes for DS decoder layer constructor changes

3887a51

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify bot added the speculative-decoding label Sep 3, 2025

tlrmchlsmth added 2 commits September 3, 2025 15:31

fixup

31e0c81

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Merge branch 'main' into tp_attn_fix_simple

c8fb93c

mergify bot added the needs-rebase label Sep 3, 2025

Merge branch 'main' into tp_attn_fix_simple

c574cab

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify bot removed the needs-rebase label Sep 3, 2025

tlrmchlsmth added 4 commits September 4, 2025 21:11

rm contiguous

898a1c4

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

comment + rename for moe chunking changes

e6a5908

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Merge branch 'main' into tp_attn_fix_simple

f6455a9

fixup comments

20a3451

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth added 2 commits September 5, 2025 01:36

fixup

ba075dd

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Merge branch 'main' into tp_attn_fix_simple

cebbc89

Merge branch 'main' into tp_attn_fix_simple

5d8aaa3

tlrmchlsmth commented Sep 5, 2025

View reviewed changes

tlrmchlsmth and others added 4 commits September 7, 2025 17:14

Merge branch 'main' into tp_attn_fix_simple

a7999c7

handle change from vllm-project#23024

d92eae0

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Merge branch 'main' into tp_attn_fix_simple

247ce5b

fixup

cdd4a39

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth requested a review from luccafong as a code owner September 8, 2025 17:25

Merge branch 'main' into tp_attn_fix_simple

f6e5905

simon-mo merged commit 955c624 into vllm-project:main Sep 9, 2025
40 of 42 checks passed

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and …

f79c6a1

…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and …

3269ff8

…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth mentioned this pull request Sep 15, 2025

[MoE] Optimize fused MoE dispatch and workspace allocation with TP leader restriction #24831

Closed

tlrmchlsmth mentioned this pull request Sep 23, 2025

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models #24982

Merged

5 tasks

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and …

bbec505

…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth mentioned this pull request Sep 30, 2025

[Bugfix][DeepSeek Fix config used for DeepseekV2 Eagle #25953

Closed

tlrmchlsmth mentioned this pull request Oct 30, 2025

[Model] Introduce Kimi Linear to vLLM #27809

Merged

5 tasks

Uh oh!

[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE #24134

[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE #24134

Conversation

tlrmchlsmth commented Sep 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Notes:

Test Plan

Test Result

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

tlrmchlsmth Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robertgshaw2-redhat Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robertgshaw2-redhat commented Sep 3, 2025

Uh oh!

robertgshaw2-redhat commented Sep 3, 2025

Uh oh!

mergify bot commented Sep 3, 2025

Uh oh!

tlrmchlsmth commented Sep 3, 2025

Uh oh!

tlrmchlsmth commented Sep 4, 2025

Uh oh!

nvpohanh commented Sep 5, 2025

Uh oh!

tlrmchlsmth Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

zou3519 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

zou3519 Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tlrmchlsmth commented Sep 3, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat Sep 3, 2025 •

edited

Loading

zou3519 Sep 11, 2025 •

edited

Loading