Add Custom Kernels For LoRA Performance #1884

taoxudonghaha · 2025-07-19T02:23:10Z

What this PR does / why we need it?

Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA

Does this PR introduce any user-facing change?

no user-facing change

How was this patch tested?

we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%.

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@40d86ee

Signed-off-by: taoxudonghaha <justsheldon@163.com>

codecov · 2025-07-21T06:54:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.12%. Comparing base (ae560f7) to head (2678c2b).
⚠️ Report is 634 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1884      +/-   ##
==========================================
+ Coverage   72.35%   73.12%   +0.76%     
==========================================
  Files          88       90       +2     
  Lines        9666     9956     +290     
==========================================
+ Hits         6994     7280     +286     
- Misses       2672     2676       +4

Flag	Coverage Δ
unittests	`73.12% <ø> (+0.76%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wangxiyuan · 2025-07-23T06:21:05Z

csrc/torch_binding.cpp

        "                         int added_vocab_end_index) -> (Tensor masked_input, Tensor mask)");
    ops.impl("get_masked_input_and_mask", torch::kPrivateUse1, &vllm_ascend::get_masked_input_and_mask);
+
+    ops.def("bgmv_shrink(Tensor! x, Tensor! weight, Tensor! indices, Tensor! y, float scale) -> ()");


once the bgmv_shrink and bgmv_expand is exposed, I think we should use it to replace the common one as well. For example here: https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/lora/punica_wrapper/punica_npu.py#L51

I have already replaced the relevant interface calls in the punica_npu.py

Signed-off-by: taoxudonghaha <justsheldon@163.com>

### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@40d86ee --------- Signed-off-by: taoxudonghaha <justsheldon@163.com>

### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@40d86ee --------- Signed-off-by: taoxudonghaha <justsheldon@163.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@40d86ee --------- Signed-off-by: taoxudonghaha <justsheldon@163.com>

add custom kernel for lora performance

bdfd1d7

Signed-off-by: taoxudonghaha <justsheldon@163.com>

github-actions bot added the module:tests label Jul 19, 2025

taoxudonghaha added 8 commits July 19, 2025 12:20

solve 310p compile error and test error

0750d3f

Signed-off-by: taoxudonghaha <justsheldon@163.com>

solve kernel test not equal

f96945f

Signed-off-by: taoxudonghaha <justsheldon@163.com>

fix the bgmv expand precision issue

11cf491

Signed-off-by: taoxudonghaha <justsheldon@163.com>

sovle bgmv expand precision problem

db65a42

Signed-off-by: taoxudonghaha <justsheldon@163.com>

modify test code format

00256bb

Signed-off-by: taoxudonghaha <justsheldon@163.com>

modify code lint error

e5c3199

Signed-off-by: taoxudonghaha <justsheldon@163.com>

modify codespell

6ac9648

Signed-off-by: taoxudonghaha <justsheldon@163.com>

modify test code format

9a13d23

Signed-off-by: taoxudonghaha <justsheldon@163.com>

wangxiyuan reviewed Jul 23, 2025

View reviewed changes

wangxiyuan mentioned this pull request Jul 23, 2025

vLLM Ascend Roadmap Q3 2025 #1168

Open

45 tasks

taoxudonghaha added 11 commits July 25, 2025 10:29

modify punica_npu to call custom operator

6b655d4

Signed-off-by: taoxudonghaha <justsheldon@163.com>

Merge branch 'vllm-project:main' into main

64fd434

adjust the sequence of input parameters

be92cc8

Signed-off-by: taoxudonghaha <justsheldon@163.com>

remove fe platform

a4f2e4e

Signed-off-by: taoxudonghaha <justsheldon@163.com>

modify torch binding compile error

241aef3

Signed-off-by: taoxudonghaha <justsheldon@163.com>

modify lint check

9012aee

Signed-off-by: taoxudonghaha <justsheldon@163.com>

solve lint error

143fdc9

Signed-off-by: taoxudonghaha <justsheldon@163.com>

lint check

7747892

Signed-off-by: taoxudonghaha <justsheldon@163.com>

lint check

2b251e7

Signed-off-by: taoxudonghaha <justsheldon@163.com>

exchange indices and output position

b3a5a08

Signed-off-by: taoxudonghaha <justsheldon@163.com>

lint check

2678c2b

Signed-off-by: taoxudonghaha <justsheldon@163.com>

wangxiyuan approved these changes Jul 26, 2025

View reviewed changes

wangxiyuan merged commit 540336e into vllm-project:main Jul 29, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Custom Kernels For LoRA Performance #1884

Add Custom Kernels For LoRA Performance #1884

Uh oh!

taoxudonghaha commented Jul 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

wangxiyuan Jul 23, 2025

Uh oh!

taoxudonghaha Jul 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Custom Kernels For LoRA Performance #1884

Add Custom Kernels For LoRA Performance #1884

Uh oh!

Conversation

taoxudonghaha commented Jul 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wangxiyuan Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

taoxudonghaha Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taoxudonghaha commented Jul 19, 2025 •

edited by github-actions bot

Loading

codecov bot commented Jul 21, 2025 •

edited

Loading