[Feature] Expert Parallelism Load Balancer (EPLB) #18343

abmfy · 2025-05-19T09:26:42Z

This PR introduces support for dynamic load balancing in expert parallelism (EP) for the deployment of Mixture-of-Experts (MoE) models.

Dynamic load balancing is essential for auxiliary-loss-free MoE models, such as the DeepSeek-V3/R1 series. This feature enables dynamic rearrangement of experts across different ranks/nodes to achieve better load balance during inference.

Additionally, this PR introduces support for redundant experts, allowing each routed expert to maintain multiple parameter copies distributed across different ranks. This further improves expert load balancing.

Running

To try out EPLB, enable it with the following options:

--enable-eplb
--num-redundant-experts 32
--eplb-window-size 1000
--eplb-step-interval 3000

You should see a log message indicating that EPLB is enabled, as well as periodic logs showing the rearrangement of experts.

Compatibility

Currently, we support DeepSeek-V2, V3, and R1 models with FP8 quantization. However, this PR has been designed with generality in mind, so extending support to other MoE models or quantization methods should be straightforward.

Adding model support:

To add support for a new model, implement the MixtureOfExperts protocol. In essence, you’ll need to:

Expose relevant MoE configurations.
Provide access to the expert weights that need to be shuffled.
Forward EPLB-related information into the FusedMoE layer.

Note: Pay close attention to the weight-loading logic. With redundant experts, you’ll need to handle additional complexity to ensure weights are loaded correctly. The expert_params_mapping returned by FusedMoE reflects the presence of redundant experts, but you may need to implement some nontrivial adjustments in the model class to prevent breaking the weight-loading process.

You can refer to the implementation changes in deepseek_v2.py.

Adding quantization support:

Adding quantization support should be straightforward, as it mainly involves forwarding the necessary arguments.

See the changes in fp8.py for reference.

We welcome contributions to help add support for additional models and quantization methods!

To-Dos

To-Do List for this PR:

Implement replicated experts in fused MoE operations
Monitor expert balancedness in metrics
Remove magic numbers
Allow turning off monitoring since it brings some overhead

Long-term To-Do List (should be done in other PRs):

Model Execution
- When using FusedMoEModularKernel, we can directly use the load metrics returned by FusedMoEPrepareAndFinalize, instead of calculating them inside expert selection. We're not doing this since not all code paths are using FusedMoEModularKernel now
EPLB Algorithm
- Add other rebalancing strategies, e.g. rebalance when balancedness falls below some threshold
- Consider treating differently for prefill and decode nodes in the rearrangement algorithm
EPLB Execution
- Parallelize the rearrangement algorithm (calculating new expert mapping, not the communication)
- Shuffle one layer at once and use multiple steps, to lower the impact on inter-token latency
- Investigate should we pre-allocate expert weight buffer used for transferring
- Take locality into consideration in expert weight transmission, e.g. prioritize transferring to GPUs on the same node
Compatibility
- Add support for DeepSeek Multi-Token Prediction (MTP) layers
- Add support for two-batch overlap ([WIP] Two batch overlap #18415)
- Add support for other MoE models, e.g. Llama 4, Qwen3
- Add support for other quantization methods
API
- Group EPLB configs together

Signed-off-by: Bowen Wang <abmfy@icloud.com>

WIP, design choices not finalized. Signed-off-by: Bowen Wang <abmfy@icloud.com>

Signed-off-by: Bowen Wang <abmfy@icloud.com>

mergify · 2025-05-19T09:27:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @abmfy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2025-05-19T09:32:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Bowen Wang <abmfy@icloud.com>

vllm/distributed/eplb/states.py

vllm/model_executor/models/deepseek.py

Signed-off-by: Bowen Wang <abmfy@icloud.com>

vllm/model_executor/models/deepseek.py

vllm/distributed/eplb/rebalance_execute.py

Signed-off-by: Bowen Wang <abmfy@icloud.com>

Moved into `FusedMoE` layers Signed-off-by: Bowen Wang <abmfy@icloud.com>

Signed-off-by: Bowen Wang <abmfy@icloud.com>

Since `grouped_topk` will assume top-2 for DeepSeek-V3 Signed-off-by: Bowen Wang <abmfy@icloud.com>

ztxdcyy · 2025-06-27T03:29:26Z

🎉 So happy to see this PR finally merged after going through so many challenges — big round of applause for the researcher's persistence and dedication! @abmfy 👏👏👏

Also, just wondering — how can we measure the benefits brought by EPLB? 🤔
Things like expert balancing, GPU utilization, TTFT, TPOT... Any suggestions or best practices? 💡

abmfy · 2025-06-27T05:23:30Z

🎉 So happy to see this PR finally merged after going through so many challenges — big round of applause for the researcher's persistence and dedication! @abmfy 👏👏👏

Also, just wondering — how can we measure the benefits brought by EPLB? 🤔 Things like expert balancing, GPU utilization, TTFT, TPOT... Any suggestions or best practices? 💡

Hi @ztxdcyy, thanks for your attention!

There’s now a default-off option --eplb-log-balancedness that logs the load balance factor across different GPUs at each step.

As for other metrics, I believe they’re not specific to the EPLB settings, so we can simply rely on standard metrics by running benchmarks and monitoring those results as usual.

Let me know what you think!

Lichunyan3 · 2025-07-04T08:07:55Z

@abmfy Hello, I'm encountering the following error when using multi-GPU parallel processing
Here's my startup command:
python -m vllm.entrypoints.openai.api_server --model="/public/models/hf_models/DeepSeek-V2-Lite-Chat-FP8-A16" --trust-remote-code -tp 2 -dp 2 --port 8200 --enforce-eager --enable-eplb --eplb-log-balancedness

This issue doesn't occur when starting with a single GPU and only appears during multi-GPU parallel processing. Have you encountered this before, or do you have any solutions?

abmfy · 2025-07-31T23:38:11Z

@abmfy Hello, I'm encountering the following error when using multi-GPU parallel processing Here's my startup command: python -m vllm.entrypoints.openai.api_server --model="/public/models/hf_models/DeepSeek-V2-Lite-Chat-FP8-A16" --trust-remote-code -tp 2 -dp 2 --port 8200 --enforce-eager --enable-eplb --eplb-log-balancedness

This issue doesn't occur when starting with a single GPU and only appears during multi-GPU parallel processing. Have you encountered this before, or do you have any solutions?

Hi @Lichunyan3, sorry for the late reply—I was traveling.

It looks like you may have missed adding --enable-expert-parallel; EPLB requires running under EP.

We’ve added some checks in #21102, so if EPLB is enabled without EP, it will now raise an error.

Bounty-hunter · 2025-08-19T09:20:11Z

did you test how balancedness imporve in benchmark_serving.py? It’s a random dataset. Will there be a significant improvement?

abmfy added 7 commits May 14, 2025 16:29

[Feature] Core EPLB algorithm

8fe6f82

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Feature] Register expert weights for DeepSeek MoE

bdda8dc

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Chore] Rename EPLB rebalance algo module name

43d52ac

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Feature] Store EPLB states in model runner

58bf9fd

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Feature] EPLB rearrangement execution

52b141f

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Feature] Add expert load metrics collection during forward

98312d3

WIP, design choices not finalized. Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Feature] Rearrange experts after a preset step interval

22a963d

Signed-off-by: Bowen Wang <abmfy@icloud.com>

mergify bot added the v1 label May 19, 2025

mergify bot added the needs-rebase label May 19, 2025

abmfy mentioned this pull request May 19, 2025

[Perf] support Eplb for deepseek v3 #14544

Closed

Merge branch 'main' into eplb

f88d836

Signed-off-by: Bowen Wang <abmfy@icloud.com>

youkaichao reviewed May 20, 2025

View reviewed changes

vllm/distributed/eplb/states.py Outdated Show resolved Hide resolved

youkaichao reviewed May 20, 2025

View reviewed changes

vllm/model_executor/models/deepseek.py Outdated Show resolved Hide resolved

abmfy added 2 commits May 20, 2025 18:02

[Feature] Use unified FusedMoE in DeepSeek-V3/R1

43ac672

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Bugfix] Copy expert mappings after rearrangement

f7ba162

Signed-off-by: Bowen Wang <abmfy@icloud.com>

wpc reviewed May 21, 2025

View reviewed changes

vllm/model_executor/models/deepseek.py Outdated Show resolved Hide resolved

y-null reviewed May 23, 2025

View reviewed changes

vllm/distributed/eplb/rebalance_execute.py Show resolved Hide resolved

abmfy added 11 commits May 23, 2025 20:19

[Chore] Move implementations to deepseek_v2.py

ba3d60f

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Chore] Remove expert load stats from forward context

ebcfcc7

Moved into `FusedMoE` layers Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Feature] Weight loading for redundant experts

620f59a

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Feature] Expert replica selection and load metrics recording

90f3ed5

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Feature] Map logical experts in weight loading

b3697de

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Bugfix] Use scatter_add_ instead of bincount for compile

5d85f61

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Bugfix] Add EPLB args in EngineArgs

e416e3c

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Bugfix] Sum up steps on EPLb rearrange

233741c

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Bugfix] Collect expert weights into a list

cfcd42c

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Bugfix] Fix typo in assertion

36b0b11

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Bugfix] Pad log2phy magging in rebalance algo

d5add3a

Signed-off-by: Bowen Wang <abmfy@icloud.com>

[Test] Allow 2 experts per group in test_initialization

c479d2c

Since `grouped_topk` will assume top-2 for DeepSeek-V3 Signed-off-by: Bowen Wang <abmfy@icloud.com>

abmfy requested review from hmellor, houseroad and simon-mo as code owners June 26, 2025 18:31

WoosukKwon merged commit e9fd658 into vllm-project:main Jun 26, 2025
96 of 101 checks passed

tlrmchlsmth mentioned this pull request Jun 27, 2025

[RFC]: Data Parallel Attention and Expert Parallel MoEs #16037

Closed

37 tasks

xuechendi mentioned this pull request Jun 27, 2025

[BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized assert #20202

Merged

4 tasks

abmfy mentioned this pull request Jul 4, 2025

[Feature]: Support EPLB for More MoE Models, e.g. Qwen 3, Llama 4 #20468

Open

1 task

lengrongfu mentioned this pull request Jul 7, 2025

[Feature] use --eplb_config to set eplb param #20562

Merged

4 tasks

lk-chen mentioned this pull request Jul 8, 2025

[Bug]: Assertion error when serving "deepseek-ai/DeepSeek-V2-Lite" with PP in 0.9.2 #20647

Closed

1 task

mgoin mentioned this pull request Jul 9, 2025

[Roadmap] vLLM Roadmap Q3 2025 #20336

Closed

This was referenced Jul 11, 2025

[RFC]: EPLB Execution Optimization From pr 18343 #20805

Open

[Performance] EPLB Execution Optimization #20990

Closed

lengrongfu mentioned this pull request Jul 16, 2025

[Feature]: Record EPLB metric, can in /metric api get info #21038

Open

1 task

hsliuustc mentioned this pull request Jul 18, 2025

[Feature][EPLB] Add support for unquantized models #21168

Open

4 tasks

jennifurhe mentioned this pull request Jul 21, 2025

[EPLB]: Add EPLB support for Grok1 [WIP] #21273

Draft

david6666666 mentioned this pull request Jul 29, 2025

[WIP][Performance] EPLB Execution Optimization #21813

Closed

9 tasks

david6666666 mentioned this pull request Aug 4, 2025

[EPLB] Optimize EPLB for Async Rearrange Experts #22179

Open

9 tasks

abmfy mentioned this pull request Sep 10, 2025

[EPLB] Reduce EPLB Inference Overhead #24573

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature] Expert Parallelism Load Balancer (EPLB) #18343

[Feature] Expert Parallelism Load Balancer (EPLB) #18343

Uh oh!

abmfy commented May 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented May 19, 2025

Uh oh!

github-actions bot commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ztxdcyy commented Jun 27, 2025

Uh oh!

abmfy commented Jun 27, 2025

Uh oh!

Lichunyan3 commented Jul 4, 2025

Uh oh!

abmfy commented Jul 31, 2025

Uh oh!

Bounty-hunter commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Uh oh!

[Feature] Expert Parallelism Load Balancer (EPLB) #18343

[Feature] Expert Parallelism Load Balancer (EPLB) #18343

Uh oh!

Conversation

abmfy commented May 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Running

Compatibility

To-Dos

Uh oh!

mergify bot commented May 19, 2025

Uh oh!

github-actions bot commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ztxdcyy commented Jun 27, 2025

Uh oh!

abmfy commented Jun 27, 2025

Uh oh!

Lichunyan3 commented Jul 4, 2025

Uh oh!

abmfy commented Jul 31, 2025

Uh oh!

Bounty-hunter commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

abmfy commented May 19, 2025 •

edited by github-actions bot

Loading