-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Feature] Expert Parallelism Load Balancer (EPLB) #18343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
WIP, design choices not finalized. Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Moved into `FusedMoE` layers Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Since `grouped_topk` will assume top-2 for DeepSeek-V3 Signed-off-by: Bowen Wang <abmfy@icloud.com>
|
🎉 So happy to see this PR finally merged after going through so many challenges — big round of applause for the researcher's persistence and dedication! @abmfy 👏👏👏 Also, just wondering — how can we measure the benefits brought by EPLB? 🤔 |
Hi @ztxdcyy, thanks for your attention! There’s now a default-off option As for other metrics, I believe they’re not specific to the EPLB settings, so we can simply rely on standard metrics by running benchmarks and monitoring those results as usual. Let me know what you think! |
|
@abmfy Hello, I'm encountering the following error when using multi-GPU parallel processing This issue doesn't occur when starting with a single GPU and only appears during multi-GPU parallel processing. Have you encountered this before, or do you have any solutions? |
Hi @Lichunyan3, sorry for the late reply—I was traveling. It looks like you may have missed adding We’ve added some checks in #21102, so if EPLB is enabled without EP, it will now raise an error. |
|
did you test how balancedness imporve in benchmark_serving.py? It’s a random dataset. Will there be a significant improvement? |


This PR introduces support for dynamic load balancing in expert parallelism (EP) for the deployment of Mixture-of-Experts (MoE) models.
Dynamic load balancing is essential for auxiliary-loss-free MoE models, such as the DeepSeek-V3/R1 series. This feature enables dynamic rearrangement of experts across different ranks/nodes to achieve better load balance during inference.
Additionally, this PR introduces support for redundant experts, allowing each routed expert to maintain multiple parameter copies distributed across different ranks. This further improves expert load balancing.
Running
To try out EPLB, enable it with the following options:
You should see a log message indicating that EPLB is enabled, as well as periodic logs showing the rearrangement of experts.
Compatibility
Currently, we support DeepSeek-V2, V3, and R1 models with FP8 quantization. However, this PR has been designed with generality in mind, so extending support to other MoE models or quantization methods should be straightforward.
Adding model support:
To add support for a new model, implement the
MixtureOfExpertsprotocol. In essence, you’ll need to:FusedMoElayer.Note: Pay close attention to the weight-loading logic. With redundant experts, you’ll need to handle additional complexity to ensure weights are loaded correctly. The
expert_params_mappingreturned byFusedMoEreflects the presence of redundant experts, but you may need to implement some nontrivial adjustments in the model class to prevent breaking the weight-loading process.You can refer to the implementation changes in deepseek_v2.py.
Adding quantization support:
Adding quantization support should be straightforward, as it mainly involves forwarding the necessary arguments.
See the changes in fp8.py for reference.
We welcome contributions to help add support for additional models and quantization methods!
To-Dos
To-Do List for this PR:
Long-term To-Do List (should be done in other PRs):
FusedMoEModularKernel, we can directly use the load metrics returned byFusedMoEPrepareAndFinalize, instead of calculating them inside expert selection. We're not doing this since not all code paths are usingFusedMoEModularKernelnow