[Refactor] Pre-transpose MoE weights for improved performance #2025

yiz-liu · 2025-07-26T01:54:44Z

What this PR does / why we need it?

Pre-transpose MoE weights for improved performance

Does this PR introduce any user-facing change?

None.

How was this patch tested?

No further test needed.

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@97349fe

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

github-actions · 2025-08-02T01:51:20Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiyuan · 2025-08-19T02:56:31Z

please rebase to fix the merge conflict if this PR is still needed.

yiz-liu · 2025-08-28T12:13:12Z

Implemented in #2614 .

… Graph (#2614) ### What this PR does / why we need it? * **Unify execution paths:** Consolidates the quantized and non-quantized execution paths into a single `fused_experts` function, removing duplicated logic and making the control flow clearer and easier to maintain. * **W8A8 dynamic quantization:** Adds support for W8A8 dynamic quantization inside the unified MoE kernel. Communication routines are updated to correctly handle dynamic quantization scales for activations. * **Weight pre-processing:** Prae-transpose the `w13` and `w2` weight matrices (as implemented in PR #2025) so that quantized and non-quantized models follow the same code path for the MoE gating, up-projection, and down-projection operations. * **All-to-all communication:** Adds an `all-to-all` collective communication pattern. For large token counts on modern hardware, `all-to-all` is more efficient than the previous `all-gather` strategy. However, `all-to-all` is not really captured and replayed due to multiple D2H operations which will trigger synchronization, and thus raise error when capture graphs. We only use `all-to-all` when fallback to `compiled_graph_for_general_shape`. * **Dynamic communication selection:** The model runner now selects the optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at runtime based on token count and the Ascend SoC version. * **Limitation:** `all-gather` is not yet supported for quantized models, which means there is still something left to do on A2. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No further test cases needed. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@d660c98 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

… Graph (vllm-project#2614) ### What this PR does / why we need it? * **Unify execution paths:** Consolidates the quantized and non-quantized execution paths into a single `fused_experts` function, removing duplicated logic and making the control flow clearer and easier to maintain. * **W8A8 dynamic quantization:** Adds support for W8A8 dynamic quantization inside the unified MoE kernel. Communication routines are updated to correctly handle dynamic quantization scales for activations. * **Weight pre-processing:** Prae-transpose the `w13` and `w2` weight matrices (as implemented in PR vllm-project#2025) so that quantized and non-quantized models follow the same code path for the MoE gating, up-projection, and down-projection operations. * **All-to-all communication:** Adds an `all-to-all` collective communication pattern. For large token counts on modern hardware, `all-to-all` is more efficient than the previous `all-gather` strategy. However, `all-to-all` is not really captured and replayed due to multiple D2H operations which will trigger synchronization, and thus raise error when capture graphs. We only use `all-to-all` when fallback to `compiled_graph_for_general_shape`. * **Dynamic communication selection:** The model runner now selects the optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at runtime based on token count and the Ascend SoC version. * **Limitation:** `all-gather` is not yet supported for quantized models, which means there is still something left to do on A2. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No further test cases needed. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@d660c98 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>

… Graph (vllm-project#2614) ### What this PR does / why we need it? * **Unify execution paths:** Consolidates the quantized and non-quantized execution paths into a single `fused_experts` function, removing duplicated logic and making the control flow clearer and easier to maintain. * **W8A8 dynamic quantization:** Adds support for W8A8 dynamic quantization inside the unified MoE kernel. Communication routines are updated to correctly handle dynamic quantization scales for activations. * **Weight pre-processing:** Prae-transpose the `w13` and `w2` weight matrices (as implemented in PR vllm-project#2025) so that quantized and non-quantized models follow the same code path for the MoE gating, up-projection, and down-projection operations. * **All-to-all communication:** Adds an `all-to-all` collective communication pattern. For large token counts on modern hardware, `all-to-all` is more efficient than the previous `all-gather` strategy. However, `all-to-all` is not really captured and replayed due to multiple D2H operations which will trigger synchronization, and thus raise error when capture graphs. We only use `all-to-all` when fallback to `compiled_graph_for_general_shape`. * **Dynamic communication selection:** The model runner now selects the optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at runtime based on token count and the Ascend SoC version. * **Limitation:** `all-gather` is not yet supported for quantized models, which means there is still something left to do on A2. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No further test cases needed. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@d660c98 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

refactor: Pre-transpose MoE weights for improved performance

5472742

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

github-actions bot added the module:ops label Jul 26, 2025

github-actions bot added the merge-conflicts label Aug 2, 2025

yiz-liu closed this Aug 28, 2025

yiz-liu mentioned this pull request Aug 28, 2025

[3/N][Feat][Graph] Support all-to-all and quantized models with ACL Graph #2614

Merged

yiz-liu deleted the pre-transpose branch August 28, 2025 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor] Pre-transpose MoE weights for improved performance #2025

[Refactor] Pre-transpose MoE weights for improved performance #2025

Uh oh!

yiz-liu commented Jul 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

wangxiyuan commented Aug 19, 2025

Uh oh!

yiz-liu commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Refactor] Pre-transpose MoE weights for improved performance #2025

[Refactor] Pre-transpose MoE weights for improved performance #2025

Uh oh!

Conversation

yiz-liu commented Jul 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

wangxiyuan commented Aug 19, 2025

Uh oh!

yiz-liu commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yiz-liu commented Jul 26, 2025 •

edited by github-actions bot

Loading