Skip to content

Conversation

@whx-sjtu
Copy link
Collaborator

@whx-sjtu whx-sjtu commented Jul 19, 2025

This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance.
Current multi-stream parallel for shared experts are shown in following pic:
image
Performance change:
Before:
image
After:
image

This PR ports PR #1561 of v0.9.1-dev to main.

@codecov
Copy link

codecov bot commented Jul 20, 2025

Codecov Report

❌ Patch coverage is 17.24138% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.72%. Comparing base (935e9d4) to head (86d7980).
⚠️ Report is 621 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/quantization/w8a8_dynamic.py 5.00% 19 Missing ⚠️
vllm_ascend/ops/fused_moe.py 37.50% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1891      +/-   ##
==========================================
- Coverage   73.85%   73.72%   -0.13%     
==========================================
  Files         103      103              
  Lines       11425    11450      +25     
==========================================
+ Hits         8438     8442       +4     
- Misses       2987     3008      +21     
Flag Coverage Δ
unittests 73.72% <17.24%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wangxiyuan
Copy link
Collaborator

  1. This is a performance improve change. No new e2e test is needed.
  2. the changed 3 files doesn't has ut test file. It's working by other developers. We can ignore it in this PR.

Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @jianzs @ApsarasX

@Yikun Yikun added the ready read for review label Jul 24, 2025
dispose_tensor, get_fused_moe_state)


def apply_mlp_decode(hidden_states_wrapper: List[torch.Tensor],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the apply_mlp_decode function, the first parameter no longer needs to be an array. You can refer to existing apply_mlp implementations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will change it.

dispose_tensor, get_fused_moe_state)


def apply_mlp_decode(hidden_states_wrapper: List[torch.Tensor],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was a new apply_mlp_decode function added? It seems essentially identical to the existing apply_mlp function, except for the split_item and output_dtype fields.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is added in order to introduce fused op npu_dequant_swiglu_quant which is only used in decode phase.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is added in order to introduce fused op npu_dequant_swiglu_quant which is only used in decode phase.

We previously used npu_dequant_swiglu_quant in apply_mlp, triggered under certain conditions.

@realliujiaxu knows more specific details. Discuss with him?

log2phy: torch.Tensor = None,
global_redundant_expert_num: int = 0,
shared_experts: Optional[Any] = None,
quantized_x_for_share: Optional[Any] = None,

This comment was marked as resolved.

global_redundant_expert_num=global_redundant_expert_num,
shared_experts=shared_experts)
shared_experts=shared_experts,
quantized_x_for_share=shared_gate_up,

This comment was marked as resolved.

@whx-sjtu whx-sjtu force-pushed the moe_ms_main branch 2 times, most recently from 8441109 to c47b811 Compare July 24, 2025 12:40
@Yikun
Copy link
Collaborator

Yikun commented Jul 26, 2025

better rebase after #1653

@github-actions github-actions bot added merge-conflicts and removed ready read for review labels Jul 28, 2025
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: whx-sjtu <2952154980@qq.com>
@jianzs jianzs merged commit b6a7f07 into vllm-project:main Jul 29, 2025
25 checks passed
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Jul 30, 2025
…t#1891)

This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@2cc5711

Signed-off-by: whx-sjtu <2952154980@qq.com>
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Jul 30, 2025
…t#1891)

This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@2cc5711

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
…t#1891)

This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@2cc5711

Signed-off-by: whx-sjtu <2952154980@qq.com>
@whx-sjtu whx-sjtu deleted the moe_ms_main branch October 20, 2025 11:50
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…t#1891)

This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@2cc5711

Signed-off-by: whx-sjtu <2952154980@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants