Skip to content

[RFC]: ATTN-FFN Disaggregation for MoE Models #22799

@chopper0126

Description

@chopper0126

Motivation.

In large-scale MoE inference, the benefits of sparsity in the decoding phase drive continuous expansion of expert parallelism (EP). Prior designs (e.g., DeepEP[1]) improve throughput at scale but have limited scalability because they place EP shards across data-parallel (DP) ranks. Recently, Attention–FFN disaggregation (AFD) has been proposed by ByteDance[2], StepFun[3], and Huawei[4]. The motivation is straightforward: the attention phase is memory-bound, whereas the FFN/expert phase is compute-bound, so a single homogeneous deployment cannot optimize both simultaneously. Module-wise heterogeneous placement is beneficial as EP continues to scale.
Based on these production insights, we propose introducing AFD in vLLM: decouple Attention and FFN/experts at resource levels so they can scale independently, and overlap communication in a “perfectly balanced” pipeline to improve throughput. The initial revision targets an eager-mode MVP without custom kernels: it runs multiple DP replicas on the Attention side and EP replicas on the MoE side, enabling cross-node M2N routing with correctness guarantees. Subsequently, we will provide stable AFD interfaces for communication, load balancing, and elasticity.

[1] https://github.com/deepseek-ai/DeepEP
[2] Zhu R, Jiang Z, Jin C, et al. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism[J]. arXiv preprint arXiv:2504.02263, 2025.
[3] Wang B, Wang B, Wan C, et al. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding[J]. arXiv preprint arXiv:2507.19427, 2025.
[4] Xiao A, He B, Zhang B, et al. xDeepServe: Model-as-a-Service on Huawei CloudMatrix384[J]. arXiv preprint arXiv:2508.02520, 2025.

Proposed Change.

We propose to start with two core changes:

  1. AFD on vLLM: Provide initial AFD capabilities for DeepSeek-v3 and other MoE models, under eager mode, with torch communication APIs for M2N; design doc

  2. AFConnector interface for XPUs: to support custom AF communication backends, e.g., stepmesh. design doc #issues

Roadmaps

  • AFD on vLLM (MVP) @chopper0126 @xiaoshudian555 @wabluy

    • Disaggregated MoE deployment:
      • Model side: redefine model structure, weight loading, and module graph;
      • Service side: define a global topo to launch a standalone MoE service with torchrun, enable communication with Attention nodes (temporarily via P2P);
    • Accuracy evaluation: validate the correctness of outputs;
    • Pipeline support: configurable micro-batch size and asynchronous M2N communications;
    • Performance optimizations: integrate features such as MTP and support custom load-balancing strategies.
    • Extensibility: support custom MoE models and refine the codebase for AFConnector.
  • AFConnector interface @jiangkuaixue123

    • collect IPs and ranks of attention&ffn instances and initialize the necessary communication resources
    • tensor level async send/recv and all-gather/reduce-scatter APIs to transfer tensors between attention and ffn instances
    • hardware agnostic interface to support various hardware backends, communication libraries, and NICs
    • plugin mechanism by Python setuptools entrypoints

After completing the two core changes, we also propose the following features:

  • Graph mode support: to accelerate AFD on XPUs;

  • Full disaggregation (AF + PD): Introduce PD disaggregation on top of AFD; @hsliuustc0106

  • Unified benchmarking framework: Standardize throughput/latency evaluation and dataset baselines;

  • Elastic and fault-tolerant AFD: Support online scale-out/in of A/F workers while preserving KV consistency and correct routing, improving fault tolerance; @leonkang1 @fangyuchu @wabluy

  • Service management: Provide a unified interface to integrate with solutions like AIBrix and enable reliable scaling management. @chopper0126

We welcome contributors in the community to discuss and co-work on the above topics and brainstorm more AFD optimizations after the first two changes :)

Feedback Period.

No response

CC List.

@chopper0126 @xiaoshudian555 @wabluy @leonkang1 @fangyuchu @jianzs @hsliuustc0106 @jiangkuaixue123

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions