[CI]Moe alltoall communication optimization #1067

weijinqian0 · 2025-06-04T13:40:50Z

[CI]Moe alltoall communication optimization
The DeepSeek V3/R1 model has 256 routing experts. During parallel inference, if the load of an EP rank is high, the overall communication and computing time is slowed down, which becomes a weakness of parallel inference because the load is unevenly distributed. However, the data volume in the prefill phase is large, and the inter-card communication time consumption/calculation time consumption and the data volume are closely related to each other. Therefore, less non-linear precision loss can be used to obtain a near-linear performance improvement.

During parallel inference, global synchronization occurs during communication. As a result, the card with low load completes the calculation first and waits for the card with the highest load to complete the calculation. Therefore, if the load is unbalanced, the card with high load slows down the overall time consumption. Significant performance gains can be achieved by discarding a small number of tokens, which is unacceptable in some precision-sensitive scenarios. However, similar to quantification, it is a solution that uses an acceptable precision loss in some scenarios for performance. In addition, a trade-off between performance and precision can be achieved by configuring a proportion of discarded tokens.

Perform the test on A3. The batch size is 8 (B), the prompt length is 3.5K tokens (S), and the parallel configuration is as follows: AttnDP=2, AttnTP=8, MoeTP=1, and MoeEP=16. In this sence, we got a 10%-15% performance gain.

Plus, the next version, we'll have an alltoallv moe.

for unquantized sence. Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

github-actions · 2025-06-05T15:42:10Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

for unquantized sence. Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

…-ascend into moe_alltoall_v6 # Conflicts: # vllm_ascend/ops/fused_moe.py

ganyi1996ppo · 2025-06-06T08:32:59Z

If this change may cause the accuracy regression, then please add flags to make this behaviour controllable. We should give choice back to the user.

…-ascend into moe_alltoall_v6 # Conflicts: # vllm_ascend/ops/fused_moe.py

…ll_v6

github-actions · 2025-06-06T12:23:27Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

[CI]Moe alltoall communication optimization The DeepSeek V3/R1 model has 256 routing experts. During parallel inference, if the load of an EP rank is high, the overall communication and computing time is slowed down, which becomes a weakness of parallel inference because the load is unevenly distributed. However, the data volume in the prefill phase is large, and the inter-card communication time consumption/calculation time consumption and the data volume are closely related to each other. Therefore, less non-linear precision loss can be used to obtain a near-linear performance improvement. During parallel inference, global synchronization occurs during communication. As a result, the card with low load completes the calculation first and waits for the card with the highest load to complete the calculation. Therefore, if the load is unbalanced, the card with high load slows down the overall time consumption. Significant performance gains can be achieved by discarding a small number of tokens, which is unacceptable in some precision-sensitive scenarios. However, similar to quantification, it is a solution that uses an acceptable precision loss in some scenarios for performance. In addition, a trade-off between performance and precision can be achieved by configuring a proportion of discarded tokens. Perform the test on A3. The batch size is 8 (B), the prompt length is 3.5K tokens (S), and the parallel configuration is as follows: AttnDP=2, AttnTP=8, MoeTP=1, and MoeEP=16. In this sence, we got a 10%-15% performance gain. Plus, the next version, we'll have an alltoallv moe. --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>

[CI]Moe alltoall communication optimization

e6c7977

for unquantized sence. Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

github-actions bot added the module:ops label Jun 4, 2025

[CI]Moe alltoall communication optimization

3b1c3df

for unquantized sence. Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

henryxuxu0716 mentioned this pull request Jun 5, 2025

[release] 0.9.0rc1 release checklist #904

Closed

76 tasks

github-actions bot added the merge-conflicts label Jun 5, 2025

weijinqian_v1 and others added 3 commits June 6, 2025 09:38

[CI]Moe alltoall communication optimization

7f86c06

for unquantized sence. Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

[CI]Moe alltoall communication optimization

f30f99d

for unquantized sence. Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

Merge branch 'main' into moe_alltoall_v6

da9f077

github-actions bot removed the merge-conflicts label Jun 6, 2025

Merge branch 'moe_alltoall_v6' of https://github.com/weijinqian0/vllm…

aa22a09

…-ascend into moe_alltoall_v6 # Conflicts: # vllm_ascend/ops/fused_moe.py

wangxiyuan added the ready read for review label Jun 6, 2025

wangxiyuan approved these changes Jun 6, 2025

View reviewed changes

weijinqian_v1 added 4 commits June 6, 2025 17:29

Merge branch 'moe_alltoall_v6' of https://github.com/weijinqian0/vllm…

7fd337e

…-ascend into moe_alltoall_v6 # Conflicts: # vllm_ascend/ops/fused_moe.py

Merge remote-tracking branch 'origin/moe_alltoall_v6' into moe_alltoa…

9ced8c0

…ll_v6

Merge remote-tracking branch 'origin/moe_alltoall_v6' into moe_alltoa…

0a2e178

…ll_v6

Merge remote-tracking branch 'origin/moe_alltoall_v6' into moe_alltoa…

6a28f2f

…ll_v6

github-actions bot added the module:core label Jun 6, 2025

github-actions bot added merge-conflicts and removed ready read for review labels Jun 6, 2025

Merge branch 'main' into moe_alltoall_v6

dd72196

github-actions bot removed the merge-conflicts label Jun 6, 2025

ganyi1996ppo merged commit e9ada68 into vllm-project:main Jun 7, 2025
29 of 31 checks passed

weijinqian0 deleted the moe_alltoall_v6 branch July 16, 2025 05:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI]Moe alltoall communication optimization #1067

[CI]Moe alltoall communication optimization #1067

Uh oh!

weijinqian0 commented Jun 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

ganyi1996ppo commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[CI]Moe alltoall communication optimization #1067

[CI]Moe alltoall communication optimization #1067

Uh oh!

Conversation

weijinqian0 commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

ganyi1996ppo commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weijinqian0 commented Jun 4, 2025 •

edited

Loading