- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Expert Parallelism (EP) Support for DeepSeek Models #12583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these: 
 🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comment but overall LGTM
| need to check the user-interface, but I feel right now we can just reuse the tp size for ep? in the future, when we have DP, EP size will automatically be DP x TP. | 
| 
 can you please elaborate on this, a bit? MLA + CUDA graphs + TP is working fine on main as far as I am aware | 
| @youkaichao The current design support TP within an EP, but we can easily change that to have EP only on MoE layers. I think we will need more discussion with others on that design decision, the current implementation of EP+TP is based on a discussion with @WoosukKwon and @simon-mo. @LucasWilkinson I just merged the main branch and CUDA graph is now working with EP+TP. | 
Signed-off-by: Jongseok Park <js_park@berkeley.edu>
Signed-off-by: Jongseok Park <js_park@berkeley.edu>
| @cakeng a test failure in CI https://buildkite.com/vllm/ci/builds/14038#019534e8-1eef-4be1-8e89-3f718502ddb4/2582-4097  | 
…o benchmark_moe.py Signed-off-by: Jongseok Park <js_park@berkeley.edu>
| Hello, I would like to inquire about the communication overhead of the current implementation. It seems that regardless of whether the MOE's EP (Expert Parallelism) is enabled, each rank will still perform an all-reduce operation on the hidden states, so the amount of data transmitted during communication is the same whether EP is enabled or not. I noticed that after enabling EP on a 16-card setup, the time taken for the all-reduce kernel has doubled, and the throughput has dropped by 30%. | 
| May I ask if this feature can be used in a service-oriented way? I see from the example in  | 
| @lewisword Yes it should work, but the API to enable EP has been changed to the  | 
| vllm ep看上去还没有实现通信(dispatch+ combine)和计算的overlap?这在deepseek这样的moe模型中显得很重要。 | 
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
| Can I double check if this feature only works with --enforce-eager? when I turn this flag off, there is Dynamo related error. Asking as the last statement of this PR description saying it works with CUDAGraph | 
The current vLLM execution only supports TP when running MoE models.
This PR adds support for Expert Parallelism (EP) for the FusedMoE Kernel and DeepSeek V2 model, which should be extendable to V3 and other MoE models as well.
When VLLM_TEST_ENABLE_EP=1, EP is automatically applied on the FusedMoE layers using the user specified tensor_parallel_size as the expert parallel size.
CUDAgraph works with EP.
Doc