Expert Parallelism (EP) Support for DeepSeek Models #12583

cakeng · 2025-01-30T18:24:13Z

The current vLLM execution only supports TP when running MoE models.

This PR adds support for Expert Parallelism (EP) for the FusedMoE Kernel and DeepSeek V2 model, which should be extendable to V3 and other MoE models as well.

When VLLM_TEST_ENABLE_EP=1, EP is automatically applied on the FusedMoE layers using the user specified tensor_parallel_size as the expert parallel size.

CUDAgraph works with EP.

Doc

github-actions · 2025-01-30T18:24:25Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac

Left some comment but overall LGTM

vllm/config.py

vllm/distributed/parallel_state.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/fused_moe/fused_moe.py

… size.

youkaichao · 2025-02-04T09:23:54Z

need to check the user-interface, but I feel right now we can just reuse the tp size for ep?

in the future, when we have DP, EP size will automatically be DP x TP.

LucasWilkinson · 2025-02-04T18:42:27Z

but the problem with CUDA graph seems to be on the current attention layer (MLA?) implementation.

can you please elaborate on this, a bit? MLA + CUDA graphs + TP is working fine on main as far as I am aware

cakeng · 2025-02-04T23:25:05Z

@youkaichao The current design support TP within an EP, but we can easily change that to have EP only on MoE layers. I think we will need more discussion with others on that design decision, the current implementation of EP+TP is based on a discussion with @WoosukKwon and @simon-mo.

@LucasWilkinson I just merged the main branch and CUDA graph is now working with EP+TP.

Signed-off-by: Jongseok Park <js_park@berkeley.edu>

simon-mo · 2025-02-24T07:49:26Z

@cakeng a test failure in CI

https://buildkite.com/vllm/ci/builds/14038#019534e8-1eef-4be1-8e89-3f718502ddb4/2582-4097

[2025-02-23T23:32:34Z] FAILED models/test_initialization.py::test_can_initialize[QuantMixtralForCausalLM] - AttributeError: 'FusedMoE' object has no attribute 'num_experts'

…o benchmark_moe.py Signed-off-by: Jongseok Park <js_park@berkeley.edu>

liweiqing1997 · 2025-03-07T08:05:32Z

Hello, I would like to inquire about the communication overhead of the current implementation.

It seems that regardless of whether the MOE's EP (Expert Parallelism) is enabled, each rank will still perform an all-reduce operation on the hidden states, so the amount of data transmitted during communication is the same whether EP is enabled or not.
The only difference is the communication latency, which is due to the need for all ranks to synchronize and wait for the MOE computations to finish when EP is enabled. I'm not sure if my analysis is correct.

I noticed that after enabling EP on a 16-card setup, the time taken for the all-reduce kernel has doubled, and the throughput has dropped by 30%.

lewisword · 2025-03-14T10:16:44Z

May I ask if this feature can be used in a service-oriented way? I see from the example in examples/offline_inference/data_parallel.py that it uses an offline multi-process invocation approach. @cakeng

cakeng · 2025-03-18T23:43:05Z

@lewisword Yes it should work, but the API to enable EP has been changed to the --enable-expert-parallel engine argument (#14305). Also, the EP performance is not necessarily better than TP in the current implementations.

xiuxin121 · 2025-03-26T06:30:56Z

vllm ep看上去还没有实现通信（dispatch+ combine）和计算的overlap?这在deepseek这样的moe模型中显得很重要。

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Neo9061 · 2025-04-20T03:16:21Z

Can I double check if this feature only works with --enforce-eager? when I turn this flag off, there is Dynamo related error.

Asking as the last statement of this PR description saying it works with CUDAGraph

cakeng and others added 11 commits January 26, 2025 22:33

MoE init

949c914

EP config integrations

ba0549e

Weight loading

7d7285d

Added EP info to moe_align_block_size

991d126

Bugs

df154b6

Working EP+TP Prototype

3d700c4

Removed debugging print statements

2d05161

Fixes

60284b5

Merge branch 'main' into moe

9cdb728

Fused MoE Kernel fixes, Errors on CudaGraph Capture

03b8afb

Merge branch 'vllm-project:main' into moe

cdb252d

cakeng requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, youkaichao, ywang96 and zhuohan123 as code owners January 30, 2025 18:24

comaniac reviewed Jan 30, 2025

View reviewed changes

Expert mapping fixes, num_experts does not need to be divisible by EP…

f7dcd7b

… size.

cakeng added 2 commits February 4, 2025 14:12

Merge branch 'main' into moe

b3e00f5

Merge branch 'main' into moe

d8cb2b3

mergify bot added the v1 label Feb 4, 2025

cakeng added 2 commits February 7, 2025 14:15

Merge branch 'main' into moe

837a0fb

Integrating to DeepSeekV3

cefcef6

cakeng added 3 commits February 22, 2025 11:45

fix test_expert_parallel.py test configurations

accb533

Signed-off-by: Jongseok Park <js_park@berkeley.edu>

Merge branch 'main' into moe

14864f7

Fix pre-commit

a0f2c21

Signed-off-by: Jongseok Park <js_park@berkeley.edu>

Revert FusedMoE num_experts variable name change and add deepseekV2 t…

5554d35

…o benchmark_moe.py Signed-off-by: Jongseok Park <js_park@berkeley.edu>

simon-mo merged commit 781096e into vllm-project:main Feb 24, 2025
54 of 55 checks passed

simon-mo changed the title ~~Expert Parallelism (EP) Support for DeepSeek V2~~ Expert Parallelism (EP) Support for DeepSeek Models Feb 25, 2025

cakeng deleted the moe branch February 26, 2025 08:33

wangxiyuan mentioned this pull request Feb 27, 2025

vLLM Ascend Roadmap Q1 2025 vllm-project/vllm-ascend#71

Closed

38 tasks

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

Expert Parallelism (EP) Support for DeepSeek V2 (vllm-project#12583)

521dce4

weedge mentioned this pull request Mar 5, 2025

feat: add vllm deploy modal inference test ai-bot-pro/achatbot#126

Merged

ubergarm mentioned this pull request Mar 6, 2025

【性能分析】这几天研究了一下nsys profile，尝试分析目前的generate时间瓶颈，分享下结论，不保证正确，希望各位讨论 kvcache-ai/ktransformers#821

Open

hmellor mentioned this pull request Apr 2, 2025

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Closed

1 task

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

Expert Parallelism (EP) Support for DeepSeek V2 (vllm-project#12583)

48fb453

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

xinyu-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 11, 2025

Expert Parallelism (EP) Support for DeepSeek V2 (vllm-project#12583)

7e12f11

xuechendi mentioned this pull request Apr 11, 2025

[deepseek r1] HPU support for deepseek HabanaAI/vllm-fork#1030

Merged

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

Expert Parallelism (EP) Support for DeepSeek V2 (vllm-project#12583)

d0b9b5f

Uh oh!

Expert Parallelism (EP) Support for DeepSeek Models #12583

Expert Parallelism (EP) Support for DeepSeek Models #12583

Uh oh!

Conversation

cakeng commented Jan 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 30, 2025

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youkaichao commented Feb 4, 2025

Uh oh!

LucasWilkinson commented Feb 4, 2025

Uh oh!

cakeng commented Feb 4, 2025

Uh oh!

simon-mo commented Feb 24, 2025

Uh oh!

Uh oh!

liweiqing1997 commented Mar 7, 2025

Uh oh!

lewisword commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cakeng commented Mar 18, 2025

Uh oh!

xiuxin121 commented Mar 26, 2025

Uh oh!

Neo9061 commented Apr 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

cakeng commented Jan 30, 2025 •

edited by github-actions bot

Loading

lewisword commented Mar 14, 2025 •

edited

Loading