Skip to content

Conversation

@SlightwindSec
Copy link
Contributor

@SlightwindSec SlightwindSec commented Aug 4, 2025

This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations.

In the previous implementation, the all2all operation was performed on unquantized hidden_states (in FP16/BF16) before quantization, resulting in substantial communication overhead. By performing quantization on each EP rank first and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast int inputs to float for the argsort operation, forcing it to run on a faster NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE quantization scenarios.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
@github-actions
Copy link

github-actions bot commented Aug 4, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
@codecov
Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 96.87500% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.72%. Comparing base (ad366bf) to head (f55f351).
⚠️ Report is 615 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/quantization/w8a8_dynamic.py 93.75% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2195      +/-   ##
==========================================
+ Coverage   76.41%   76.72%   +0.30%     
==========================================
  Files         113      111       -2     
  Lines       12553    12516      -37     
==========================================
+ Hits         9593     9603      +10     
+ Misses       2960     2913      -47     
Flag Coverage Δ
unittests 76.72% <96.87%> (+0.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ganyi1996ppo ganyi1996ppo merged commit f3b50c5 into vllm-project:main Aug 5, 2025
30 of 32 checks passed
zzhx1 pushed a commit to lidenghui1110/vllm-ascend that referenced this pull request Aug 11, 2025
…ll2All Communication (vllm-project#2195)

This PR significantly optimizes performance for quantized Mixture of
Experts (MoE) layers by changing the order of quantization and
communication operations.

In the previous implementation, the `all2all` operation was performed on
unquantized `hidden_states` (in FP16/BF16) *before* quantization,
resulting in substantial communication overhead. By performing
quantization on each EP rank **first** and then sending the much smaller
quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast `int` inputs
to `float` for the `argsort` operation, forcing it to run on a faster
NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE
quantization scenarios.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@7175817

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
zzhx1 pushed a commit to lidenghui1110/vllm-ascend that referenced this pull request Aug 11, 2025
…ll2All Communication (vllm-project#2195)

This PR significantly optimizes performance for quantized Mixture of
Experts (MoE) layers by changing the order of quantization and
communication operations.

In the previous implementation, the `all2all` operation was performed on
unquantized `hidden_states` (in FP16/BF16) *before* quantization,
resulting in substantial communication overhead. By performing
quantization on each EP rank **first** and then sending the much smaller
quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast `int` inputs
to `float` for the `argsort` operation, forcing it to run on a faster
NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE
quantization scenarios.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@7175817

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
…ll2All Communication (vllm-project#2195)

This PR significantly optimizes performance for quantized Mixture of
Experts (MoE) layers by changing the order of quantization and
communication operations.

In the previous implementation, the `all2all` operation was performed on
unquantized `hidden_states` (in FP16/BF16) *before* quantization,
resulting in substantial communication overhead. By performing
quantization on each EP rank **first** and then sending the much smaller
quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast `int` inputs
to `float` for the `argsort` operation, forcing it to run on a faster
NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE
quantization scenarios.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@7175817

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
@SlightwindSec SlightwindSec deleted the upstream_main_moe branch October 13, 2025 01:42
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…ll2All Communication (vllm-project#2195)

This PR significantly optimizes performance for quantized Mixture of
Experts (MoE) layers by changing the order of quantization and
communication operations.

In the previous implementation, the `all2all` operation was performed on
unquantized `hidden_states` (in FP16/BF16) *before* quantization,
resulting in substantial communication overhead. By performing
quantization on each EP rank **first** and then sending the much smaller
quantized data, we reduce the communication volume by nearly 50%.

Additionally, this PR includes a minor optimization to cast `int` inputs
to `float` for the `argsort` operation, forcing it to run on a faster
NPU core instead of the AICPU.

These changes lead to a clear and significant performance gain in MoE
quantization scenarios.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@7175817

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants