[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication #2195

SlightwindSec · 2025-08-04T07:31:13Z

This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations.

In the previous implementation, the all2all operation was performed on unquantized hidden_states (in FP16/BF16) before quantization, resulting in substantial communication overhead. By performing quantization on each EP rank first and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast int inputs to float for the argsort operation, forcing it to run on a faster NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE quantization scenarios.

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@7175817

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

github-actions · 2025-08-04T08:24:08Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

codecov · 2025-08-05T01:54:14Z

Codecov Report

❌ Patch coverage is 96.87500% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.72%. Comparing base (ad366bf) to head (f55f351).
⚠️ Report is 615 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/quantization/w8a8_dynamic.py	93.75%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2195      +/-   ##
==========================================
+ Coverage   76.41%   76.72%   +0.30%     
==========================================
  Files         113      111       -2     
  Lines       12553    12516      -37     
==========================================
+ Hits         9593     9603      +10     
+ Misses       2960     2913      -47

Flag	Coverage Δ
unittests	`76.72% <96.87%> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ll2All Communication (vllm-project#2195) This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations. In the previous implementation, the `all2all` operation was performed on unquantized `hidden_states` (in FP16/BF16) *before* quantization, resulting in substantial communication overhead. By performing quantization on each EP rank **first** and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%. Additionally, this PR includes a minor optimization to cast `int` inputs to `float` for the `argsort` operation, forcing it to run on a faster NPU core instead of the AICPU. These changes lead to a clear and significant performance gain in MoE quantization scenarios. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@7175817 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

optimize w8a8_dynamic fused_experts_with_all2all impl

6fa4b90

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

github-actions bot added module:tests module:quantization labels Aug 4, 2025

wangxiyuan approved these changes Aug 4, 2025

View reviewed changes

SlightwindSec added 3 commits August 4, 2025 22:01

update w8a8_dynamic ut

9d07c33

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

Merge remote-tracking branch 'upstream/main' into upstream_main_moe

c7cec0e

fix ut

f55f351

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

ganyi1996ppo approved these changes Aug 5, 2025

View reviewed changes

ganyi1996ppo merged commit f3b50c5 into vllm-project:main Aug 5, 2025
30 of 32 checks passed

SlightwindSec mentioned this pull request Sep 18, 2025

[RFC]: Prefill Performance Optimization for DeepSeek Large Scale EP #3012

Open

SlightwindSec deleted the upstream_main_moe branch October 13, 2025 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication #2195

[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication #2195

Uh oh!

SlightwindSec commented Aug 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

codecov bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication #2195

[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication #2195

Uh oh!

Conversation

SlightwindSec commented Aug 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

codecov bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SlightwindSec commented Aug 4, 2025 •

edited by github-actions bot

Loading

codecov bot commented Aug 5, 2025 •

edited

Loading