-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE #24134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly introduces sequence parallelism to the MoE layer in the DeepseekV2 model to prevent redundant computations when using both Tensor Parallelism and Expert Parallelism. The approach of chunking the input before the MoE layer and gathering the output afterward is sound. I've found one critical issue that could lead to a runtime error, which I've detailed in a specific comment.
| # If using expert parallel, ensure the input to the experts is | ||
| # SP to avoid duplicate work. | ||
| # Not needed for pplx-kernels as it can handle duplicate input tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abcdabcd987 and @nandor could you double-check me here: Can pplx handle replicated input tokens in the TP attn + EP MoE case?
| # If using expert parallel, ensure the input to the experts is | ||
| # SP to avoid duplicate work. | ||
| # Not needed for pplx-kernels as it can handle duplicate input tokens. | ||
| self.is_sequence_parallel = (envs.VLLM_ALL2ALL_BACKEND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call this use_sequence_parallel_mlp since we use seq parallelism for just the mlp layer here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I like this because the MLP being sequence parallel is kind of a side effect. And we need to pass it into the fused_moe layer for the chunking.
I'm not a fan of the sequence_parallel name though so definitely open to suggestions
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
This looks good to me. Just left some comments on explaining the parallelism setup. |
|
there are genuine failures in the CI related to DeepSeek MTP |
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
Seeing some issues with CUDA graphs on: |
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
It was a torch.compile issue. I ended up having to work around it by wrapping the sp chunking code in a custom op |
|
cc @weireweire |
| # Chunk x along the num_tokens axis for sequence parallelism | ||
| # NOTE: This is wrapped in a torch custom op to work around the following issue: | ||
| # The output tensor can have a sequence length 0 at small input sequence lengths | ||
| # even though we explicitly pad to avoid this. | ||
| def sequence_parallel_chunk(x: torch.Tensor) -> torch.Tensor: | ||
| tp_size = get_tensor_model_parallel_world_size() | ||
| tp_rank = get_tensor_model_parallel_rank() | ||
|
|
||
| # all_gather needs the sequence length to be divisible by tp_size | ||
| seq_len = x.size(0) | ||
| remainder = seq_len % tp_size | ||
| if remainder != 0: | ||
| pad_len = tp_size - remainder | ||
| x = nn.functional.pad(x, (0, 0, 0, pad_len)) | ||
|
|
||
| chunk = x.shape[0] // tp_size | ||
| start = tp_rank * chunk | ||
| return torch.narrow(x, 0, start, chunk) | ||
|
|
||
|
|
||
| def sequence_parallel_chunk_fake(x: torch.Tensor) -> torch.Tensor: | ||
| tp_size = get_tensor_model_parallel_world_size() | ||
| seq_len = cdiv(x.size(0), tp_size) | ||
| shape = list(x.shape) | ||
| shape[0] = seq_len | ||
| out = torch.empty(shape, dtype=x.dtype, device=x.device) | ||
| return out | ||
|
|
||
|
|
||
| direct_register_custom_op( | ||
| op_name="sequence_parallel_chunk", | ||
| op_func=sequence_parallel_chunk, | ||
| mutates_args=[], | ||
| fake_impl=sequence_parallel_chunk_fake, | ||
| dispatch_key=current_platform.dispatch_key, | ||
| tags=(torch.Tag.needs_fixed_stride_order, ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @zou3519 @ProExpertProg in case you have better ideas than this wrap-it-in-a-custom-op hack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrmchlsmth do you have the original error message and/or a stack trace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this custom operator is technically incorrect. The output is a view of the input, which means bad things can happen in the presence of mutation. I don't know if vLLM specifically will hit any of those issues, it depends on how it's being used.
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…EP MoE (vllm-project#24134) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Currently, when running attention with TP and using
--enable-expert-parallel, the MoE layers will do duplicate work when using DeepEP. In this case, the output of attention will be replicated across TP ranks and each token copy will be dispatched to the EP ranks it gets routed to, multiplying the amount of work bytp_size.This PR avoids this duplicate work by ensuring the input to the MoE layer is sequence parallel instead of replicated.
Notes:
all_reduceat the end of attention with areduce_scatter. This reduces the amount of computation but is a little more invasive to the model definition since we need to handle the sharding of the residuals. This has an extra effect of de-duplicating the work done during layer norms (minor improvement).Test Plan
Test Result
This PR
5b31cb1 (last good commit before #24119 landed):