[WIP][main][Prefill Perf] Parallel Strategy Optimizations #2110

SlightwindSec · 2025-07-30T08:44:50Z

Shared Expert Sharding Strategy Update: Switched from TP-aligned to pure DP for shared experts, enabling more efficient execution.
O_Proj AllReduce → ReduceScatter: Reduced communication overhead by using ReduceScatter, made possible by pure DP sharding.
AllGather Postponed: Delayed to after QKV down projection to reduce synchronization impact during prefill.

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@c2e75b3

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

github-actions · 2025-07-30T09:38:15Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

#2090) ### What this PR does / why we need it? Perfill optimization compatible torchair graph mode Divide the prefill optimization points into two parts: supporting graph mode and not supporting graph mode. In this way, in graph mode, it will not affect the performance of decode and can obtain most of the prefill optimization benefits. and this pr will cherry-pick to main in [pr2110](#2110) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? online example ```shell export VLLM_USE_V1=1 export HCCL_BUFFSIZE=800 python -m vllm.entrypoints.openai.api_server --model=`model_path` --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"enable_prefill_optimizations": true,"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true},"chunked_prefill_for_mla":true}' ``` Signed-off-by: Wang Kunpeng <1289706727@qq.com>

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

jianzs · 2025-07-31T13:09:20Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.

Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.

Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@SlightwindSec take a look at this hint

codecov · 2025-07-31T13:52:49Z

Codecov Report

❌ Patch coverage is 16.96429% with 93 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.26%. Comparing base (8cf97d8) to head (204da87).

Files with missing lines	Patch %	Lines
vllm_ascend/models/deepseek_v2.py	21.73%	36 Missing ⚠️
vllm_ascend/quantization/w8a8_dynamic.py	3.12%	31 Missing ⚠️
vllm_ascend/attention/mla_v1.py	8.33%	22 Missing ⚠️
vllm_ascend/ops/fused_moe.py	55.55%	4 Missing ⚠️

❌ Your patch check has failed because the patch coverage (16.96%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2110      +/-   ##
==========================================
- Coverage   76.67%   76.26%   -0.42%     
==========================================
  Files         107      107              
  Lines       11968    12035      +67     
==========================================
+ Hits         9177     9179       +2     
- Misses       2791     2856      +65

Flag	Coverage Δ
unittests	`76.26% <16.96%> (-0.42%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-08-01T01:51:16Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

SlightwindSec · 2025-08-01T08:19:10Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.

Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.

Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@SlightwindSec take a look at this hint

OK

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

### What this PR does / why we need it? 1.delete fused_experts_with_all2all_v2 method, combine prefill optimization points into alltoall method 2.Access to D2H & initRoutingQuantV2 and this pr will cherry-pick to main in #2110 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? use initRoutingQuantV2 with PTA poc package PTA poc package link https://pytorch-package.obs.cn-north-4.myhuaweicloud.com:443/pta/personal/cache/pytorch/v7.1.0-pytorch2.5.1-vllm/pytorchv7.1.0-pytorch2.5.1-vllm_3.11_aarch64.tar.gz --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>

github-actions · 2025-08-02T01:51:22Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

jianzs

This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.

SlightwindSec · 2025-08-04T06:58:09Z

This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.

Thanks for the feedback. I can split the PR into separate parts as suggested. However, I believe the enable_prefill_optimization flag is necessary, as it allows users to choose between higher performance with increased memory usage (shared experts in DP) and lower memory usage with slightly lower performance (shared experts in TP). It's a trade-off between memory and speed that some users may want to control explicitly.

SlightwindSec · 2025-08-04T07:39:46Z

This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.

I've split out the MoE-related optimization (without the flag) into a separate PR: #2195 for easier review. Please take a look when you have time. Thanks!

kunpengW-code · 2025-08-04T12:26:49Z

This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.

I've split out the MoE-related optimization (without the flag) into a separate PR: #2195 for easier review. Please take a look when you have time. Thanks!

And we split out the Parallel strategy optimizations (with the flag) into a separate PR: #2198.When we change the shared experts from TP splitting to DP splitting, it improves performance but also increases the memory usage. Set enable_shared_expert_dp flag and is turned on by default. When the user is more sensitive to memory, we can manually turn this switch off. Please take a look when you have time. Thanks! @jianzs

[feat] add prefill optimizations

b388ec4

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

kunpengW-code mentioned this pull request Jul 30, 2025

[0.9.1][Prefill Perf] prefill optimization support torchair graph mode #2090

Merged

github-actions bot added module:ops module:core labels Jul 30, 2025

SlightwindSec added 2 commits July 31, 2025 20:41

Merge remote-tracking branch 'upstream/main' into upstream_main_dev

e2052c1

fix ci no attr error

d66277a

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

github-actions bot added the merge-conflicts label Aug 1, 2025

Merge remote-tracking branch 'upstream/main' into upstream_main_dev

534e25b

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

github-actions bot removed the merge-conflicts label Aug 1, 2025

SlightwindSec added 3 commits August 1, 2025 10:40

optimize w8a8 fused_experts_with_all2all

ab25d4d

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

add init_routing_quant pytorch impl and fix lint

46c4c47

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

fix lint

204da87

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

github-actions bot added the module:quantization label Aug 1, 2025

add ut

04c30b1

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

github-actions bot added the module:tests label Aug 1, 2025

kunpengW-code mentioned this pull request Aug 1, 2025

[0.9.1][Prefill Perf] add D2H & initRoutingQuantV2 #2038

Merged

github-actions bot added the merge-conflicts label Aug 2, 2025

SlightwindSec added 2 commits August 4, 2025 09:32

fix ut

739cfec

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

Merge remote-tracking branch 'upstream/main' into upstream_main_dev

a64ae59

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>

github-actions bot removed the merge-conflicts label Aug 4, 2025

jianzs requested changes Aug 4, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into upstream_main_dev

c352c1e

SlightwindSec closed this Aug 4, 2025

SlightwindSec deleted the upstream_main_dev branch October 13, 2025 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][main][Prefill Perf] Parallel Strategy Optimizations #2110

[WIP][main][Prefill Perf] Parallel Strategy Optimizations #2110

Uh oh!

SlightwindSec commented Jul 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

jianzs commented Jul 31, 2025

Uh oh!

codecov bot commented Jul 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

SlightwindSec commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

jianzs left a comment

Uh oh!

SlightwindSec commented Aug 4, 2025 •

edited

Loading

Uh oh!

SlightwindSec commented Aug 4, 2025

Uh oh!

kunpengW-code commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP][main][Prefill Perf] Parallel Strategy Optimizations #2110

[WIP][main][Prefill Perf] Parallel Strategy Optimizations #2110

Uh oh!

Conversation

SlightwindSec commented Jul 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

jianzs commented Jul 31, 2025

Uh oh!

codecov bot commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

SlightwindSec commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

jianzs left a comment

Choose a reason for hiding this comment

Uh oh!

SlightwindSec commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SlightwindSec commented Aug 4, 2025

Uh oh!

kunpengW-code commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SlightwindSec commented Jul 30, 2025 •

edited by github-actions bot

Loading

codecov bot commented Jul 31, 2025 •

edited

Loading

SlightwindSec commented Aug 4, 2025 •

edited

Loading