Skip to content

Conversation

@SlightwindSec
Copy link
Contributor

@SlightwindSec SlightwindSec commented Jul 30, 2025

  1. Shared Expert Sharding Strategy Update: Switched from TP-aligned to pure DP for shared experts, enabling more efficient execution.
  2. O_Proj AllReduce → ReduceScatter: Reduced communication overhead by using ReduceScatter, made possible by pure DP sharding.
  3. AllGather Postponed: Delayed to after QKV down projection to reduce synchronization impact during prefill.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

ganyi1996ppo pushed a commit that referenced this pull request Jul 30, 2025
#2090)

### What this PR does / why we need it?

Perfill optimization compatible torchair graph mode
Divide the prefill optimization points into two parts: supporting graph
mode and not supporting graph mode. In this way, in graph mode, it will
not affect the performance of decode and can obtain most of the prefill
optimization benefits.

and this pr will cherry-pick to main in
[pr2110](#2110)
### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

online example
```shell
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=800 
python -m vllm.entrypoints.openai.api_server --model=`model_path` --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"enable_prefill_optimizations": true,"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true},"chunked_prefill_for_mla":true}'
```

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
@jianzs
Copy link
Collaborator

jianzs commented Jul 31, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@SlightwindSec take a look at this hint

@codecov
Copy link

codecov bot commented Jul 31, 2025

Codecov Report

❌ Patch coverage is 16.96429% with 93 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.26%. Comparing base (8cf97d8) to head (204da87).

Files with missing lines Patch % Lines
vllm_ascend/models/deepseek_v2.py 21.73% 36 Missing ⚠️
vllm_ascend/quantization/w8a8_dynamic.py 3.12% 31 Missing ⚠️
vllm_ascend/attention/mla_v1.py 8.33% 22 Missing ⚠️
vllm_ascend/ops/fused_moe.py 55.55% 4 Missing ⚠️

❌ Your patch check has failed because the patch coverage (16.96%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2110      +/-   ##
==========================================
- Coverage   76.67%   76.26%   -0.42%     
==========================================
  Files         107      107              
  Lines       11968    12035      +67     
==========================================
+ Hits         9177     9179       +2     
- Misses       2791     2856      +65     
Flag Coverage Δ
unittests 76.26% <16.96%> (-0.42%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Aug 1, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
@SlightwindSec
Copy link
Contributor Author

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@SlightwindSec take a look at this hint

OK

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
ganyi1996ppo pushed a commit that referenced this pull request Aug 1, 2025
### What this PR does / why we need it?
1.delete fused_experts_with_all2all_v2 method, combine prefill
optimization points into alltoall method
2.Access to D2H & initRoutingQuantV2

and this pr will cherry-pick to main in
#2110

### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

use initRoutingQuantV2 with PTA poc package
PTA poc package link
https://pytorch-package.obs.cn-north-4.myhuaweicloud.com:443/pta/personal/cache/pytorch/v7.1.0-pytorch2.5.1-vllm/pytorchv7.1.0-pytorch2.5.1-vllm_3.11_aarch64.tar.gz

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
@github-actions
Copy link

github-actions bot commented Aug 2, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Copy link
Collaborator

@jianzs jianzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.

@SlightwindSec
Copy link
Contributor Author

SlightwindSec commented Aug 4, 2025

This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.

Thanks for the feedback. I can split the PR into separate parts as suggested. However, I believe the enable_prefill_optimization flag is necessary, as it allows users to choose between higher performance with increased memory usage (shared experts in DP) and lower memory usage with slightly lower performance (shared experts in TP). It's a trade-off between memory and speed that some users may want to control explicitly.

@SlightwindSec
Copy link
Contributor Author

This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.

I've split out the MoE-related optimization (without the flag) into a separate PR: #2195 for easier review. Please take a look when you have time. Thanks!

@kunpengW-code
Copy link
Contributor

This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.

I've split out the MoE-related optimization (without the flag) into a separate PR: #2195 for easier review. Please take a look when you have time. Thanks!

And we split out the Parallel strategy optimizations (with the flag) into a separate PR: #2198.When we change the shared experts from TP splitting to DP splitting, it improves performance but also increases the memory usage. Set enable_shared_expert_dp flag and is turned on by default. When the user is more sensitive to memory, we can manually turn this switch off. Please take a look when you have time. Thanks! @jianzs

@SlightwindSec SlightwindSec deleted the upstream_main_dev branch October 13, 2025 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants