Skip to content

Conversation

@shen-shanshan
Copy link
Collaborator

@shen-shanshan shen-shanshan commented Jul 24, 2025

What this PR does / why we need it?

Refactor V1 Attention for better extensibility (prepared for torchair attention refactor).

Main changes:

  • Move different kinds of foward into their method respectively, e.g., _forward_prefill_no_cache(), _forward_prefill_cache_hit(), _forward_decode_only(), _forward_v1_style().

Does this PR introduce any user-facing change?

No.

How was this patch tested?

unit test:

pytest tests/ut/attention/test_attention_v1.py::TestAscendAttentionBackendImpl
========================================================================== test session starts ===========================================================================
platform linux -- Python 3.10.18, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/sss/github/vllm-ascend
configfile: pyproject.toml
plugins: cov-6.2.1, asyncio-1.1.0, anyio-4.9.0, mock-3.14.1
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 10 items                                                                                                                                                       

tests/ut/attention/test_attention_v1.py ..........                                                                                                                 [100%]

============================================================================ warnings summary ============================================================================
../../miniconda3/envs/vllm/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8
  /home/sss/miniconda3/envs/vllm/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================== 10 passed, 1 warning in 0.43s ======================================================================

e2e test:

python examples/offline_inference_npu.py
outputs
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 75.07it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.05s/it, est. speed input: 5.26 toks/s, output: 95.57 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Dr. David M. Bader, and I am a board-certified orthopedic surgeon with a subspecialty in sports medicine. I am a member of the American Academy of Orthopedic Surgeons, the American Orthopedic Society for Sports Medicine, and the American Medical Society for Sports Medicine. I am also a member of the American College of Sports Medicine and the American College of Surgeons. I am a fellow of the American Academy of Orthopedic Surgeons and the American Orth'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces. The president is further empowered to appoint federal judges, including members of the Supreme Court, subject to Senate approval. The president is also responsible for the enforcement of federal law and may grant federal pardons and reprieves. The president is further empowered to make treaties, subject to Senate ratification, and to receive foreign ambassadors'
Prompt: 'The capital of France is', Generated text: " Paris. Which of the following statements is true?\nA. Paris is the capital of France.\nB. Paris is not the capital of France.\nC. Paris is the capital of Germany.\nD. Paris is the capital of Italy.\nTo determine which statement is true, let's analyze each option step by step:\n\nA. Paris is the capital of France.\n- This statement is true. Paris is indeed the capital of France.\n\nB. Paris is not the capital of France.\n- This statement is"
Prompt: 'The future of AI is', Generated text: ' here. It’s not just a buzzword or a distant dream. It’s a reality that’s transforming the way we live, work, and interact with technology. From self-driving cars to virtual assistants, AI is revolutionizing industries and creating new opportunities. But with great power comes great responsibility. As AI becomes more advanced, it’s crucial to consider the ethical implications and ensure that it’s used for the betterment of society. In this article, we’ll explore the current state of AI, its'

@codecov
Copy link

codecov bot commented Jul 24, 2025

Codecov Report

❌ Patch coverage is 86.27451% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.28%. Comparing base (1ab1541) to head (e55c60f).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/attention/attention_v1.py 86.27% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1995      +/-   ##
==========================================
- Coverage   76.31%   76.28%   -0.03%     
==========================================
  Files         116      117       +1     
  Lines       13238    13278      +40     
==========================================
+ Hits        10102    10129      +27     
- Misses       3136     3149      +13     
Flag Coverage Δ
unittests 76.28% <86.27%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Yikun Yikun added accuracy-test enable all accuracy test for PR ready-for-test start test by label for PR labels Jul 24, 2025
@shen-shanshan shen-shanshan force-pushed the air branch 3 times, most recently from e5eff76 to 8915a39 Compare July 25, 2025 07:12
@shen-shanshan shen-shanshan requested a review from wangxiyuan July 25, 2025 08:11
@wangxiyuan
Copy link
Collaborator

let's wait this PR merged first. #1979

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@shen-shanshan
Copy link
Collaborator Author

@wangxiyuan The CI has passed.

@github-actions
Copy link

github-actions bot commented Aug 1, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

github-actions bot commented Aug 7, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@shen-shanshan shen-shanshan changed the title [Misc] Refactor V1 Attention for Better Extensibility [2/N][Refactor] Refactor V1 Attention for Better Extensibility Aug 10, 2025
@shen-shanshan shen-shanshan changed the title [2/N][Refactor] Refactor V1 Attention for Better Extensibility [2/N][Refactor] Refactor v1 attention for better extensibility Aug 10, 2025
@shen-shanshan shen-shanshan changed the title [2/N][Refactor] Refactor v1 attention for better extensibility [2/N][Refactor] Refactor V1 attention for better extensibility Aug 10, 2025
Signed-off-by: shen-shanshan <467638484@qq.com>
@shen-shanshan
Copy link
Collaborator Author

@wangxiyuan I have adjusted this PR, and only kept the methods that needed to be overwritten by child class. I will do other things in later PRs.

@shen-shanshan
Copy link
Collaborator Author

@wangxiyuan The CI has passed. Does this can be merged?

@shen-shanshan shen-shanshan added ready-for-test start test by label for PR accuracy-test enable all accuracy test for PR and removed ready-for-test start test by label for PR accuracy-test enable all accuracy test for PR labels Aug 13, 2025
@wangxiyuan wangxiyuan merged commit 55d0790 into vllm-project:main Aug 14, 2025
39 of 40 checks passed
zhoux77899 added a commit to zhoux77899/vllm-ascend that referenced this pull request Aug 14, 2025
… MoE layers (#3)

* feat(performance): support `GroupedMatmulSwigluQuant` in `W8A8_DYNAMIC` quantized MoE layers

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(bug): fix bug

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* feat(ops): enable grouped_matmul_swiglu_quant by default

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(test): fix broken test

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(test): temporally skip broken test due to oom

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(test): change bias1 to tensor

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(bug): update group_list handling and weight scale in dynamic methods

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* feat(ops): replace all splited gmm and swiglu

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* feat(quantization): split w4a8 and w8a8 apply

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(test): replace w8a8 function in apply

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* feat(cumsum): add cumsum_group_list function for group list processing

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* [Doc] Add container image save/load FAQ for offline environments (vllm-project#2347)

### What this PR does / why we need it?

Add Docker export/import guide for air-gapped environments

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

NA

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@d16aa3d

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

* [Bugfix] fix the oom when chunkprefill with long context like 64k (vllm-project#2319)

The attn mask was declared in the mla.py,we don't need the splitfuse
mask when mla chunkprefill, and this mask will cause memory problem when
long context like 64k or 128k

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@14a5d90

---------

Signed-off-by: haojiangzheng <justineric096@gmail.com>

* [Quickfix] Add the missing `apply_router_weight_on_input` in FusedMoE init (vllm-project#2348)

### What this PR does / why we need it?
Add the missing `apply_router_weight_on_input` in FusedMoE init
Quick fix on
vllm-project#2268 (comment)

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@6807af8

Signed-off-by: MengqingCao <cmq0113@163.com>

* [2/N][Refactor] Refactor V1 attention for better extensibility (vllm-project#1995)

### What this PR does / why we need it?

Refactor V1 Attention for better extensibility (prepared for torchair
attention refactor).

**Main changes:**
- Move different kinds of foward into their method respectively, e.g.,
`_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`,
`_forward_decode_only()`, `_forward_v1_style()`.

### Does this PR introduce _any_ user-facing change?

No.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@14a5d90

Signed-off-by: shen-shanshan <467638484@qq.com>

* [Misc] Remove redundant imported `envs`, using `envs_ascend` instead (vllm-project#2193)

### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.

```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@71683ca

---------

Signed-off-by: shen-shanshan <467638484@qq.com>

* feat(torchair): consider not using gmmswigluquant when torchair enabled

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(dtype): unify `w1_scale` dtype

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

* fix(lint): fix lint

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

---------

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Signed-off-by: haojiangzheng <justineric096@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: jack <QwertyJack@users.noreply.github.com>
Co-authored-by: zhenghaojiang <zhjoneson@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
…project#1995)

### What this PR does / why we need it?

Refactor V1 Attention for better extensibility (prepared for torchair
attention refactor).

**Main changes:**
- Move different kinds of foward into their method respectively, e.g.,
`_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`,
`_forward_decode_only()`, `_forward_v1_style()`.

### Does this PR introduce _any_ user-facing change?

No.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@14a5d90

Signed-off-by: shen-shanshan <467638484@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…project#1995)

### What this PR does / why we need it?

Refactor V1 Attention for better extensibility (prepared for torchair
attention refactor).

**Main changes:**
- Move different kinds of foward into their method respectively, e.g.,
`_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`,
`_forward_decode_only()`, `_forward_v1_style()`.

### Does this PR introduce _any_ user-facing change?

No.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@14a5d90

Signed-off-by: shen-shanshan <467638484@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

accuracy-test enable all accuracy test for PR ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants