-
Notifications
You must be signed in to change notification settings - Fork 543
Open
Labels
RFCRequest For CommentsRequest For Comments
Milestone
Description
Motivation.
The primary motivation for this proposal is to significantly improve the prefill performance of the DeepSeek Large Scale Expert Parallelism (EP) model. Our analysis identified several bottlenecks in the existing implementation, particularly in communication overhead and inefficient operational ordering within the attention and Mixture of Experts (MoE) layers. By optimizing parallelism strategies, reducing communication volumes, and reordering key operations, we can achieve substantial performance gains during the prefill stage.
Proposed Change.
We propose the following four changes to optimize prefill performance:
- Modified Parallelism Strategy for Shared Experts: The parallelism strategy for shared experts will be changed from following the attention's Tensor Parallelism (TP) to full Data Parallelism (DP). This involves replicating the shared expert weights on each card, eliminating cross-device communication for this component during the forward pass.
- Optimized Communication for Attention Output: In the attention mechanism, the
AllReduceoperation currently used for the output projection will be replaced withReduceScatter. This change reduces the overall data transferred between devices, leading to a direct improvement in communication efficiency. - Delayed
AllGatherin MoE Layers: TheAllGatheroperation that follows the Mixture of Experts (MoE) layer will be moved to after the QKV down-projection. Because activations are partitioned along the token dimension, this move reduces the computational workload of the down-projection and also decreases the volume of data that needs to be communicated afterward. - Optimized W8A8 Quantization Order for MoE: For W8A8 quantized MoE layers, the order of operations will be reversed. We propose to quantize the activations before the
All2Allcommunication, rather than after. This simple change reduces the communication payload by nearly 50%, yielding a significant performance enhancement.
Implementation Details:
- The changes described in points 1, 2, and 3 are implemented in PR [main][prefill optimization] Optimize parallel strategies to reduce communication overhead #2198.
- The change described in point 4 is implemented in PR [main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication #2195.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Metadata
Metadata
Assignees
Labels
RFCRequest For CommentsRequest For Comments