Skip to content

Conversation

@taoxudonghaha
Copy link
Contributor

@taoxudonghaha taoxudonghaha commented Jul 19, 2025

What this PR does / why we need it?

Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA

Does this PR introduce any user-facing change?

no user-facing change

How was this patch tested?

we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%.
image

Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
@codecov
Copy link

codecov bot commented Jul 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.12%. Comparing base (ae560f7) to head (2678c2b).
⚠️ Report is 634 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1884      +/-   ##
==========================================
+ Coverage   72.35%   73.12%   +0.76%     
==========================================
  Files          88       90       +2     
  Lines        9666     9956     +290     
==========================================
+ Hits         6994     7280     +286     
- Misses       2672     2676       +4     
Flag Coverage Δ
unittests 73.12% <ø> (+0.76%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

" int added_vocab_end_index) -> (Tensor masked_input, Tensor mask)");
ops.impl("get_masked_input_and_mask", torch::kPrivateUse1, &vllm_ascend::get_masked_input_and_mask);

ops.def("bgmv_shrink(Tensor! x, Tensor! weight, Tensor! indices, Tensor! y, float scale) -> ()");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once the bgmv_shrink and bgmv_expand is exposed, I think we should use it to replace the common one as well. For example here: https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/lora/punica_wrapper/punica_npu.py#L51

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already replaced the relevant interface calls in the punica_npu.py

@wangxiyuan wangxiyuan mentioned this pull request Jul 23, 2025
45 tasks
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
@wangxiyuan wangxiyuan merged commit 540336e into vllm-project:main Jul 29, 2025
25 checks passed
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Jul 30, 2025
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.

- vLLM version: v0.9.2
- vLLM main:
vllm-project/vllm@40d86ee

---------

Signed-off-by: taoxudonghaha <justsheldon@163.com>
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Jul 30, 2025
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.

- vLLM version: v0.9.2
- vLLM main:
vllm-project/vllm@40d86ee

---------

Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.

- vLLM version: v0.9.2
- vLLM main:
vllm-project/vllm@40d86ee

---------

Signed-off-by: taoxudonghaha <justsheldon@163.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.

- vLLM version: v0.9.2
- vLLM main:
vllm-project/vllm@40d86ee

---------

Signed-off-by: taoxudonghaha <justsheldon@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants