-
Notifications
You must be signed in to change notification settings - Fork 530
Add Custom Kernels For LoRA Performance #1884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1884 +/- ##
==========================================
+ Coverage 72.35% 73.12% +0.76%
==========================================
Files 88 90 +2
Lines 9666 9956 +290
==========================================
+ Hits 6994 7280 +286
- Misses 2672 2676 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| " int added_vocab_end_index) -> (Tensor masked_input, Tensor mask)"); | ||
| ops.impl("get_masked_input_and_mask", torch::kPrivateUse1, &vllm_ascend::get_masked_input_and_mask); | ||
|
|
||
| ops.def("bgmv_shrink(Tensor! x, Tensor! weight, Tensor! indices, Tensor! y, float scale) -> ()"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once the bgmv_shrink and bgmv_expand is exposed, I think we should use it to replace the common one as well. For example here: https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/lora/punica_wrapper/punica_npu.py#L51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already replaced the relevant interface calls in the punica_npu.py
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
Signed-off-by: taoxudonghaha <justsheldon@163.com>
### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@40d86ee --------- Signed-off-by: taoxudonghaha <justsheldon@163.com>
### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@40d86ee --------- Signed-off-by: taoxudonghaha <justsheldon@163.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@40d86ee --------- Signed-off-by: taoxudonghaha <justsheldon@163.com>
### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@40d86ee --------- Signed-off-by: taoxudonghaha <justsheldon@163.com>
What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA
Does this PR introduce any user-facing change?
no user-facing change
How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py

Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%.