-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Description
Motivation.
In large-scale MoE inference, the benefits of sparsity in the decoding phase drive continuous expansion of expert parallelism (EP). Prior designs (e.g., DeepEP[1]) improve throughput at scale but have limited scalability because they place EP shards across data-parallel (DP) ranks. Recently, Attention–FFN disaggregation (AFD) has been proposed by ByteDance[2], StepFun[3], and Huawei[4]. The motivation is straightforward: the attention phase is memory-bound, whereas the FFN/expert phase is compute-bound, so a single homogeneous deployment cannot optimize both simultaneously. Module-wise heterogeneous placement is beneficial as EP continues to scale.
Based on these production insights, we propose introducing AFD in vLLM: decouple Attention and FFN/experts at resource levels so they can scale independently, and overlap communication in a “perfectly balanced” pipeline to improve throughput. The initial revision targets an eager-mode MVP without custom kernels: it runs multiple DP replicas on the Attention side and EP replicas on the MoE side, enabling cross-node M2N routing with correctness guarantees. Subsequently, we will provide stable AFD interfaces for communication, load balancing, and elasticity.
[1] https://github.com/deepseek-ai/DeepEP
[2] Zhu R, Jiang Z, Jin C, et al. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism[J]. arXiv preprint arXiv:2504.02263, 2025.
[3] Wang B, Wang B, Wan C, et al. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding[J]. arXiv preprint arXiv:2507.19427, 2025.
[4] Xiao A, He B, Zhang B, et al. xDeepServe: Model-as-a-Service on Huawei CloudMatrix384[J]. arXiv preprint arXiv:2508.02520, 2025.
Proposed Change.
We propose to start with two core changes:
-
AFD on vLLM: Provide initial AFD capabilities for DeepSeek-v3 and other MoE models, under eager mode, with torch communication APIs for M2N; design doc
-
AFConnector interface for XPUs: to support custom AF communication backends, e.g., stepmesh. design doc #issues
Roadmaps
-
AFD on vLLM (MVP) @chopper0126 @xiaoshudian555 @wabluy
- Disaggregated MoE deployment:
- Model side: redefine model structure, weight loading, and module graph;
- Service side: define a global topo to launch a standalone MoE service with torchrun, enable communication with Attention nodes (temporarily via P2P);
- Accuracy evaluation: validate the correctness of outputs;
- Pipeline support: configurable micro-batch size and asynchronous M2N communications;
- Performance optimizations: integrate features such as MTP and support custom load-balancing strategies.
- Extensibility: support custom MoE models and refine the codebase for AFConnector.
- Disaggregated MoE deployment:
-
AFConnector interface @jiangkuaixue123
- collect IPs and ranks of attention&ffn instances and initialize the necessary communication resources
- tensor level async send/recv and all-gather/reduce-scatter APIs to transfer tensors between attention and ffn instances
- hardware agnostic interface to support various hardware backends, communication libraries, and NICs
- plugin mechanism by Python setuptools entrypoints
After completing the two core changes, we also propose the following features:
-
Graph mode support: to accelerate AFD on XPUs;
-
Full disaggregation (AF + PD): Introduce PD disaggregation on top of AFD; @hsliuustc0106
-
Unified benchmarking framework: Standardize throughput/latency evaluation and dataset baselines;
-
Elastic and fault-tolerant AFD: Support online scale-out/in of A/F workers while preserving KV consistency and correct routing, improving fault tolerance; @leonkang1 @fangyuchu @wabluy
-
Service management: Provide a unified interface to integrate with solutions like AIBrix and enable reliable scaling management. @chopper0126
We welcome contributors in the community to discuss and co-work on the above topics and brainstorm more AFD optimizations after the first two changes :)
Feedback Period.
No response
CC List.
@chopper0126 @xiaoshudian555 @wabluy @leonkang1 @fangyuchu @jianzs @hsliuustc0106 @jiangkuaixue123
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.