[Perf][MoE] Improve MoE multistream parallel performace. #1891

whx-sjtu · 2025-07-19T10:56:59Z

This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance.
Current multi-stream parallel for shared experts are shown in following pic:

Performance change：
Before：

After：

This PR ports PR #1561 of v0.9.1-dev to main.

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@2cc5711

codecov · 2025-07-20T11:15:18Z

Codecov Report

❌ Patch coverage is 17.24138% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.72%. Comparing base (935e9d4) to head (86d7980).
⚠️ Report is 621 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/quantization/w8a8_dynamic.py	5.00%	19 Missing ⚠️
vllm_ascend/ops/fused_moe.py	37.50%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1891      +/-   ##
==========================================
- Coverage   73.85%   73.72%   -0.13%     
==========================================
  Files         103      103              
  Lines       11425    11450      +25     
==========================================
+ Hits         8438     8442       +4     
- Misses       2987     3008      +21

Flag	Coverage Δ
unittests	`73.72% <17.24%> (-0.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wangxiyuan · 2025-07-23T01:33:14Z

This is a performance improve change. No new e2e test is needed.
the changed 3 files doesn't has ut test file. It's working by other developers. We can ignore it in this PR.

Yikun

also cc @jianzs @ApsarasX

ApsarasX · 2025-07-24T02:42:22Z

vllm_ascend/quantization/w8a8_dynamic.py

                               dispose_tensor, get_fused_moe_state)


+def apply_mlp_decode(hidden_states_wrapper: List[torch.Tensor],


For the apply_mlp_decode function, the first parameter no longer needs to be an array. You can refer to existing apply_mlp implementations.

Thanks, I will change it.

ApsarasX · 2025-07-24T02:45:21Z

vllm_ascend/quantization/w8a8_dynamic.py

                               dispose_tensor, get_fused_moe_state)


+def apply_mlp_decode(hidden_states_wrapper: List[torch.Tensor],


Why was a new apply_mlp_decode function added? It seems essentially identical to the existing apply_mlp function, except for the split_item and output_dtype fields.

This function is added in order to introduce fused op npu_dequant_swiglu_quant which is only used in decode phase.

This function is added in order to introduce fused op npu_dequant_swiglu_quant which is only used in decode phase.

We previously used npu_dequant_swiglu_quant in apply_mlp, triggered under certain conditions.

@realliujiaxu knows more specific details. Discuss with him?

vllm_ascend/quantization/w8a8_dynamic.py

        log2phy: torch.Tensor = None,
        global_redundant_expert_num: int = 0,
        shared_experts: Optional[Any] = None,
+        quantized_x_for_share: Optional[Any] = None,


vllm_ascend/quantization/w8a8_dynamic.py

                global_redundant_expert_num=global_redundant_expert_num,
-                shared_experts=shared_experts)
+                shared_experts=shared_experts,
+                quantized_x_for_share=shared_gate_up,


Yikun · 2025-07-26T10:29:05Z

better rebase after #1653

github-actions · 2025-07-28T06:08:50Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: whx-sjtu <2952154980@qq.com>

…t#1891) This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@2cc5711 Signed-off-by: whx-sjtu <2952154980@qq.com>

…t#1891) This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@2cc5711 Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

…t#1891) This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@2cc5711 Signed-off-by: whx-sjtu <2952154980@qq.com>

github-actions bot added module:ops module:quantization labels Jul 19, 2025

whx-sjtu force-pushed the moe_ms_main branch 2 times, most recently from 17b02d0 to c58d29b Compare July 20, 2025 10:59

whx-sjtu force-pushed the moe_ms_main branch from c58d29b to a733962 Compare July 22, 2025 11:43

wangxiyuan approved these changes Jul 23, 2025

View reviewed changes

Yikun approved these changes Jul 24, 2025

View reviewed changes

Yikun added the ready read for review label Jul 24, 2025

wangxiyuan approved these changes Jul 24, 2025

View reviewed changes

ApsarasX reviewed Jul 24, 2025

View reviewed changes

whx-sjtu force-pushed the moe_ms_main branch 2 times, most recently from 8441109 to c47b811 Compare July 24, 2025 12:40

ApsarasX approved these changes Jul 25, 2025

View reviewed changes

whx-sjtu force-pushed the moe_ms_main branch from c47b811 to 7683b7a Compare July 28, 2025 03:34

github-actions bot added merge-conflicts and removed ready read for review labels Jul 28, 2025

import moe multi-stream

86d7980

Signed-off-by: whx-sjtu <2952154980@qq.com>

whx-sjtu force-pushed the moe_ms_main branch from 7683b7a to 86d7980 Compare July 28, 2025 09:59

github-actions bot removed the merge-conflicts label Jul 28, 2025

jianzs merged commit b6a7f07 into vllm-project:main Jul 29, 2025
25 checks passed

whx-sjtu deleted the moe_ms_main branch October 20, 2025 11:50

		dispose_tensor, get_fused_moe_state)


		def apply_mlp_decode(hidden_states_wrapper: List[torch.Tensor],

[Perf][MoE] Improve MoE multistream parallel performace. #1891

[Perf][MoE] Improve MoE multistream parallel performace. #1891

Uh oh!

Conversation

whx-sjtu commented Jul 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wangxiyuan commented Jul 23, 2025

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

ApsarasX Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

whx-sjtu Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

ApsarasX Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

whx-sjtu Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

ApsarasX Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Yikun commented Jul 26, 2025

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

whx-sjtu commented Jul 19, 2025 •

edited by github-actions bot

Loading

codecov bot commented Jul 20, 2025 •

edited

Loading