-
Notifications
You must be signed in to change notification settings - Fork 543
[WIP][main][Prefill Perf] Parallel Strategy Optimizations #2110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
#2090) ### What this PR does / why we need it? Perfill optimization compatible torchair graph mode Divide the prefill optimization points into two parts: supporting graph mode and not supporting graph mode. In this way, in graph mode, it will not affect the performance of decode and can obtain most of the prefill optimization benefits. and this pr will cherry-pick to main in [pr2110](#2110) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? online example ```shell export VLLM_USE_V1=1 export HCCL_BUFFSIZE=800 python -m vllm.entrypoints.openai.api_server --model=`model_path` --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"enable_prefill_optimizations": true,"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true},"chunked_prefill_for_mla":true}' ``` Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
@SlightwindSec take a look at this hint |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (16.96%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2110 +/- ##
==========================================
- Coverage 76.67% 76.26% -0.42%
==========================================
Files 107 107
Lines 11968 12035 +67
==========================================
+ Hits 9177 9179 +2
- Misses 2791 2856 +65
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
OK |
### What this PR does / why we need it? 1.delete fused_experts_with_all2all_v2 method, combine prefill optimization points into alltoall method 2.Access to D2H & initRoutingQuantV2 and this pr will cherry-pick to main in #2110 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? use initRoutingQuantV2 with PTA poc package PTA poc package link https://pytorch-package.obs.cn-north-4.myhuaweicloud.com:443/pta/personal/cache/pytorch/v7.1.0-pytorch2.5.1-vllm/pytorchv7.1.0-pytorch2.5.1-vllm_3.11_aarch64.tar.gz --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR won’t be merged. Please split the features into separate PRs and avoid adding on/off flags like enable_prefill_optimization.
Thanks for the feedback. I can split the PR into separate parts as suggested. However, I believe the |
I've split out the MoE-related optimization (without the flag) into a separate PR: #2195 for easier review. Please take a look when you have time. Thanks! |
And we split out the Parallel strategy optimizations (with the flag) into a separate PR: #2198.When we change the shared experts from TP splitting to DP splitting, it improves performance but also increases the memory usage. Set enable_shared_expert_dp flag and is turned on by default. When the user is more sensitive to memory, we can manually turn this switch off. Please take a look when you have time. Thanks! @jianzs |
Uh oh!
There was an error while loading. Please reload this page.