Skip to content

[RFC]: Prefill Performance Optimization for DeepSeek Large Scale EP #3012

@SlightwindSec

Description

@SlightwindSec

Motivation.

The primary motivation for this proposal is to significantly improve the prefill performance of the DeepSeek Large Scale Expert Parallelism (EP) model. Our analysis identified several bottlenecks in the existing implementation, particularly in communication overhead and inefficient operational ordering within the attention and Mixture of Experts (MoE) layers. By optimizing parallelism strategies, reducing communication volumes, and reordering key operations, we can achieve substantial performance gains during the prefill stage.

Proposed Change.

We propose the following four changes to optimize prefill performance:

  1. Modified Parallelism Strategy for Shared Experts: The parallelism strategy for shared experts will be changed from following the attention's Tensor Parallelism (TP) to full Data Parallelism (DP). This involves replicating the shared expert weights on each card, eliminating cross-device communication for this component during the forward pass.
  2. Optimized Communication for Attention Output: In the attention mechanism, the AllReduce operation currently used for the output projection will be replaced with ReduceScatter. This change reduces the overall data transferred between devices, leading to a direct improvement in communication efficiency.
  3. Delayed AllGather in MoE Layers: The AllGather operation that follows the Mixture of Experts (MoE) layer will be moved to after the QKV down-projection. Because activations are partitioned along the token dimension, this move reduces the computational workload of the down-projection and also decreases the volume of data that needs to be communicated afterward.
  4. Optimized W8A8 Quantization Order for MoE: For W8A8 quantized MoE layers, the order of operations will be reversed. We propose to quantize the activations before the All2All communication, rather than after. This simple change reduces the communication payload by nearly 50%, yielding a significant performance enhancement.

Implementation Details:

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For Comments

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions