[2/N][Refactor] Refactor V1 attention for better extensibility #1995

shen-shanshan · 2025-07-24T12:16:24Z

What this PR does / why we need it?

Refactor V1 Attention for better extensibility (prepared for torchair attention refactor).

Main changes:

Move different kinds of foward into their method respectively, e.g., _forward_prefill_no_cache(), _forward_prefill_cache_hit(), _forward_decode_only(), _forward_v1_style().

Does this PR introduce any user-facing change?

No.

How was this patch tested?

unit test:

pytest tests/ut/attention/test_attention_v1.py::TestAscendAttentionBackendImpl
========================================================================== test session starts ===========================================================================
platform linux -- Python 3.10.18, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/sss/github/vllm-ascend
configfile: pyproject.toml
plugins: cov-6.2.1, asyncio-1.1.0, anyio-4.9.0, mock-3.14.1
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 10 items                                                                                                                                                       

tests/ut/attention/test_attention_v1.py ..........                                                                                                                 [100%]

============================================================================ warnings summary ============================================================================
../../miniconda3/envs/vllm/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8
  /home/sss/miniconda3/envs/vllm/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================== 10 passed, 1 warning in 0.43s ======================================================================

e2e test:

python examples/offline_inference_npu.py

outputs

Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 75.07it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.05s/it, est. speed input: 5.26 toks/s, output: 95.57 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Dr. David M. Bader, and I am a board-certified orthopedic surgeon with a subspecialty in sports medicine. I am a member of the American Academy of Orthopedic Surgeons, the American Orthopedic Society for Sports Medicine, and the American Medical Society for Sports Medicine. I am also a member of the American College of Sports Medicine and the American College of Surgeons. I am a fellow of the American Academy of Orthopedic Surgeons and the American Orth'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces. The president is further empowered to appoint federal judges, including members of the Supreme Court, subject to Senate approval. The president is also responsible for the enforcement of federal law and may grant federal pardons and reprieves. The president is further empowered to make treaties, subject to Senate ratification, and to receive foreign ambassadors'
Prompt: 'The capital of France is', Generated text: " Paris. Which of the following statements is true?\nA. Paris is the capital of France.\nB. Paris is not the capital of France.\nC. Paris is the capital of Germany.\nD. Paris is the capital of Italy.\nTo determine which statement is true, let's analyze each option step by step:\n\nA. Paris is the capital of France.\n- This statement is true. Paris is indeed the capital of France.\n\nB. Paris is not the capital of France.\n- This statement is"
Prompt: 'The future of AI is', Generated text: ' here. It’s not just a buzzword or a distant dream. It’s a reality that’s transforming the way we live, work, and interact with technology. From self-driving cars to virtual assistants, AI is revolutionizing industries and creating new opportunities. But with great power comes great responsibility. As AI becomes more advanced, it’s crucial to consider the ethical implications and ensure that it’s used for the betterment of society. In this article, we’ll explore the current state of AI, its'

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@14a5d90

codecov · 2025-07-24T13:34:12Z

Codecov Report

❌ Patch coverage is 86.27451% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.28%. Comparing base (1ab1541) to head (e55c60f).
⚠️ Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/attention/attention_v1.py	86.27%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1995      +/-   ##
==========================================
- Coverage   76.31%   76.28%   -0.03%     
==========================================
  Files         116      117       +1     
  Lines       13238    13278      +40     
==========================================
+ Hits        10102    10129      +27     
- Misses       3136     3149      +13

Flag	Coverage Δ
unittests	`76.28% <86.27%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wangxiyuan · 2025-07-25T14:17:13Z

let's wait this PR merged first. #1979

github-actions · 2025-07-26T09:25:52Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

shen-shanshan · 2025-07-31T09:39:33Z

@wangxiyuan The CI has passed.

github-actions · 2025-08-01T01:20:20Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-08-07T01:18:01Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan · 2025-08-11T09:32:50Z

@wangxiyuan I have adjusted this PR, and only kept the methods that needed to be overwritten by child class. I will do other things in later PRs.

shen-shanshan · 2025-08-12T01:50:16Z

@wangxiyuan The CI has passed. Does this can be merged?

… MoE layers (#3) * feat(performance): support `GroupedMatmulSwigluQuant` in `W8A8_DYNAMIC` quantized MoE layers Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(bug): fix bug Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * feat(ops): enable grouped_matmul_swiglu_quant by default Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(test): fix broken test Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(test): temporally skip broken test due to oom Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(test): change bias1 to tensor Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(bug): update group_list handling and weight scale in dynamic methods Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * feat(ops): replace all splited gmm and swiglu Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * feat(quantization): split w4a8 and w8a8 apply Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(test): replace w8a8 function in apply Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * feat(cumsum): add cumsum_group_list function for group list processing Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * [Doc] Add container image save/load FAQ for offline environments (vllm-project#2347) ### What this PR does / why we need it? Add Docker export/import guide for air-gapped environments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@d16aa3d Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> * [Bugfix] fix the oom when chunkprefill with long context like 64k (vllm-project#2319) The attn mask was declared in the mla.py，we don't need the splitfuse mask when mla chunkprefill, and this mask will cause memory problem when long context like 64k or 128k - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 --------- Signed-off-by: haojiangzheng <justineric096@gmail.com> * [Quickfix] Add the missing `apply_router_weight_on_input` in FusedMoE init (vllm-project#2348) ### What this PR does / why we need it? Add the missing `apply_router_weight_on_input` in FusedMoE init Quick fix on vllm-project#2268 (comment) ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@6807af8 Signed-off-by: MengqingCao <cmq0113@163.com> * [2/N][Refactor] Refactor V1 attention for better extensibility (vllm-project#1995) ### What this PR does / why we need it? Refactor V1 Attention for better extensibility (prepared for torchair attention refactor). **Main changes:** - Move different kinds of foward into their method respectively, e.g., `_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`, `_forward_decode_only()`, `_forward_v1_style()`. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: shen-shanshan <467638484@qq.com> * [Misc] Remove redundant imported `envs`, using `envs_ascend` instead (vllm-project#2193) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@71683ca --------- Signed-off-by: shen-shanshan <467638484@qq.com> * feat(torchair): consider not using gmmswigluquant when torchair enabled Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(dtype): unify `w1_scale` dtype Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> * fix(lint): fix lint Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: haojiangzheng <justineric096@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: jack <QwertyJack@users.noreply.github.com> Co-authored-by: zhenghaojiang <zhjoneson@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Shanshan Shen <467638484@qq.com>

…project#1995) ### What this PR does / why we need it? Refactor V1 Attention for better extensibility (prepared for torchair attention refactor). **Main changes:** - Move different kinds of foward into their method respectively, e.g., `_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`, `_forward_decode_only()`, `_forward_v1_style()`. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: shen-shanshan <467638484@qq.com>

Yikun added accuracy-test enable all accuracy test for PR ready-for-test start test by label for PR labels Jul 24, 2025

shen-shanshan force-pushed the air branch 3 times, most recently from e5eff76 to 8915a39 Compare July 25, 2025 07:12

shen-shanshan mentioned this pull request Jul 25, 2025

[3/N][Refactor] Move torchair_attention to torchair dir #2017

Merged

shen-shanshan requested a review from wangxiyuan July 25, 2025 08:11

github-actions bot added the merge-conflicts label Jul 26, 2025

shen-shanshan force-pushed the air branch from 8915a39 to b812bee Compare July 28, 2025 09:45

github-actions bot removed the merge-conflicts label Jul 28, 2025

shen-shanshan force-pushed the air branch from 11096c1 to 84eebc5 Compare July 31, 2025 06:58

github-actions bot added the merge-conflicts label Aug 1, 2025

shen-shanshan force-pushed the air branch from 84eebc5 to 6cdfef4 Compare August 4, 2025 02:39

shen-shanshan removed the merge-conflicts label Aug 4, 2025

github-actions bot added the merge-conflicts label Aug 7, 2025

shen-shanshan force-pushed the air branch from 76b3448 to f4dab0e Compare August 8, 2025 03:34

shen-shanshan removed the merge-conflicts label Aug 8, 2025

shen-shanshan mentioned this pull request Aug 8, 2025

[RFC]: Refactor torchair in vllm-ascend #2273

Closed

6 tasks

shen-shanshan force-pushed the air branch from f4dab0e to 01e35d3 Compare August 8, 2025 06:39

shen-shanshan changed the title ~~[Misc] Refactor V1 Attention for Better Extensibility~~ [2/N][Refactor] Refactor V1 Attention for Better Extensibility Aug 10, 2025

shen-shanshan changed the title ~~[2/N][Refactor] Refactor V1 Attention for Better Extensibility~~ [2/N][Refactor] Refactor v1 attention for better extensibility Aug 10, 2025

shen-shanshan changed the title ~~[2/N][Refactor] Refactor v1 attention for better extensibility~~ [2/N][Refactor] Refactor V1 attention for better extensibility Aug 10, 2025

shen-shanshan force-pushed the air branch from 01e35d3 to c5bfe02 Compare August 11, 2025 02:01

wangxiyuan approved these changes Aug 11, 2025

View reviewed changes

ApsarasX approved these changes Aug 11, 2025

View reviewed changes

Refactor V1 attention for better extensibility

e55c60f

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan force-pushed the air branch from c5bfe02 to e55c60f Compare August 11, 2025 09:27

shen-shanshan added ready-for-test start test by label for PR accuracy-test enable all accuracy test for PR and removed ready-for-test start test by label for PR accuracy-test enable all accuracy test for PR labels Aug 13, 2025

wangxiyuan approved these changes Aug 14, 2025

View reviewed changes

wangxiyuan merged commit 55d0790 into vllm-project:main Aug 14, 2025
39 of 40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2/N][Refactor] Refactor V1 attention for better extensibility #1995

[2/N][Refactor] Refactor V1 attention for better extensibility #1995

Uh oh!

shen-shanshan commented Jul 24, 2025 •

edited by wangxiyuan

Loading

Uh oh!

codecov bot commented Jul 24, 2025 •

edited

Loading

Uh oh!

wangxiyuan commented Jul 25, 2025

Uh oh!

github-actions bot commented Jul 26, 2025

Uh oh!

shen-shanshan commented Jul 31, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

shen-shanshan commented Aug 11, 2025

Uh oh!

shen-shanshan commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[2/N][Refactor] Refactor V1 attention for better extensibility #1995

[2/N][Refactor] Refactor V1 attention for better extensibility #1995

Uh oh!

Conversation

shen-shanshan commented Jul 24, 2025 • edited by wangxiyuan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wangxiyuan commented Jul 25, 2025

Uh oh!

github-actions bot commented Jul 26, 2025

Uh oh!

shen-shanshan commented Jul 31, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

shen-shanshan commented Aug 11, 2025

Uh oh!

shen-shanshan commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shen-shanshan commented Jul 24, 2025 •

edited by wangxiyuan

Loading

codecov bot commented Jul 24, 2025 •

edited

Loading