[RFC]: ATTN-FFN Disaggregation for MoE Models

### Motivation.

In large-scale MoE inference, the benefits of sparsity in the decoding phase drive continuous expansion of expert parallelism (EP). Prior designs (e.g., DeepEP[1]) improve throughput at scale but have limited scalability because they place EP shards across data-parallel (DP) ranks. Recently, Attention–FFN disaggregation (AFD) has been proposed by ByteDance[2], StepFun[3], and Huawei[4]. The motivation is straightforward: the attention phase is memory-bound, whereas the FFN/expert phase is compute-bound, so a single homogeneous deployment cannot optimize both simultaneously. Module-wise heterogeneous placement is beneficial as EP continues to scale. 
Based on these production insights, we propose introducing AFD in vLLM: decouple Attention and FFN/experts at resource levels so they can scale independently, and overlap communication in a “perfectly balanced” pipeline to improve throughput. The initial revision targets an eager-mode MVP without custom kernels: it runs multiple DP replicas on the Attention side and EP replicas on the MoE side, enabling cross-node M2N routing with correctness guarantees. Subsequently, we will provide stable AFD interfaces for communication, load balancing, and elasticity.

[1] https://github.com/deepseek-ai/DeepEP
[2] Zhu R, Jiang Z, Jin C, et al. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism[J]. arXiv preprint arXiv:2504.02263, 2025.
[3] Wang B, Wang B, Wan C, et al. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding[J]. arXiv preprint arXiv:2507.19427, 2025.
[4] Xiao A, He B, Zhang B, et al. xDeepServe: Model-as-a-Service on Huawei CloudMatrix384[J]. arXiv preprint arXiv:2508.02520, 2025.


### Proposed Change.

We propose to start with two core changes:

1. AFD on vLLM: Provide initial AFD capabilities for DeepSeek-v3 and other MoE models, under eager mode, with torch communication APIs for M2N;                         [design doc](https://docs.google.com/document/d/1QNYSQP5-KEdOD8cfptaL9eftrIBjDwiH1Tr1TmdtQFU/edit?usp=sharing)

2. AFConnector interface for XPUs: to support custom AF communication backends, e.g., stepmesh.                                                                                                          [design doc](https://docs.google.com/document/d/1joaJJQLzgoqQFHM5RZBfp-EqUbXXCyA3iWflG3kCxzo/edit?tab=t.adg90cnymtcl) [#issues](https://github.com/vllm-project/vllm/issues/21644)


**Roadmaps**

- [ ] AFD on vLLM (MVP)  @chopper0126 @xiaoshudian555 @wabluy                             

    - [x] Disaggregated MoE deployment: 
        - [x] Model side: redefine model structure, weight loading, and module graph;
        - [x] Service side: define a global topo to launch a standalone MoE service with torchrun, enable communication with Attention nodes (temporarily via P2P);
    - [x] Accuracy evaluation: validate the correctness of outputs;
    - [ ] Pipeline support: configurable micro-batch size and asynchronous M2N communications;
    - [ ] Performance optimizations: integrate features such as MTP and support custom load-balancing strategies.
    - [ ] Extensibility: support custom MoE models and refine the codebase for AFConnector.
- [ ] AFConnector interface @jiangkuaixue123
    - [x] collect IPs and ranks of attention&ffn instances and initialize the necessary communication resources
    - [ ] tensor level async send/recv and all-gather/reduce-scatter APIs to transfer tensors between attention and ffn instances
    - [ ] hardware agnostic interface to support various hardware backends, communication libraries, and NICs
    - [ ] plugin mechanism by Python setuptools entrypoints

After completing the two core changes, we also propose the following features:

- [ ] Graph mode support: to accelerate AFD on XPUs;

- [ ] Full disaggregation (AF + PD): Introduce PD disaggregation on top of AFD; @hsliuustc0106

- [ ] Unified benchmarking framework: Standardize throughput/latency evaluation and dataset baselines;

- [ ] Elastic and fault-tolerant AFD: Support online scale-out/in of A/F workers while preserving KV consistency and correct routing, improving fault tolerance; @leonkang1 @fangyuchu @wabluy                                                                                              

- [ ] Service management: Provide a unified interface to integrate with solutions like AIBrix and enable reliable scaling management. @chopper0126 

We welcome contributors in the community to discuss and co-work on the above topics and brainstorm more AFD optimizations after the first two changes :)


### Feedback Period.

_No response_

### CC List.

@chopper0126 @xiaoshudian555 @wabluy @leonkang1 @fangyuchu @jianzs @hsliuustc0106 @jiangkuaixue123

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: ATTN-FFN Disaggregation for MoE Models #22799

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: ATTN-FFN Disaggregation for MoE Models #22799

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions