Enable kvcache_nz for the decode process in torchair graph mode #1098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

wangxiyuan merged 5 commits into vllm-project:main from chenwaner:main

Jun 11, 2025

Contributor

chenwaner commented Jun 6, 2025 •

edited

Loading

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?

Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms
operator precision test:
aclnnFusedInferAttentionScoreV3_result.csv
tpot test from @ttanzhiqiang, and curl one result is normal
Enable kvcache_nz for the decode process in torchair graph mode #1098 (comment)
Enable kvcache_nz for the decode process in torchair graph mode #1098 (comment)


          kvcache nz

Signed-off-by: chenwaner <861645847@qq.com>

By set env VLLM_ENABLE_KV_NZ to enable kvcache_nz, so that the kvcache layout will be NZ in the decode process in the graph mode. Disable the optimaition, the kvcache layout is ND by default.

Date:      Fri Jun 6 11:19:12 2025 +0800

chenwaner force-pushed the main branch from d487069 to 8740191 Compare

June 6, 2025 03:30

github-actions bot added the module:core label

Collaborator

ganyi1996ppo commented Jun 6, 2025

Have you test the accuracy from dataset, and if possible, can you paste the performance boost rate from your test?

Contributor

ttanzhiqiang commented Jun 6, 2025

Use #1101 as the baseline

improve tpot 1ms

wangxiyuan reviewed

View reviewed changes

vllm_ascend/attention/mla_v1.py

    
                                           self.qk_rope_head_dim)

                          attn_output, _ = torch.ops.npu.npu_fused_infer_attention_score(

                          if self.enable_kv_nz:

Collaborator

wangxiyuan Jun 6, 2025

I just realize that this feature rely on torchair graph right? So please move this flag to ascend config(i.e. addintional config) instead of env.

chenwaner force-pushed the main branch 2 times, most recently from 029afcf to aa7ea20 Compare

June 6, 2025 10:04

Contributor Author

chenwaner commented Jun 6, 2025

Use #1101 as the baseline improve tpot 1ms

@ttanzhiqiang do u have accuracy test result of this PR?

chenwaner force-pushed the main branch from 5bca8cd to 5473fb6 Compare

June 6, 2025 14:45

github-actions bot commented Jun 7, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the merge-conflicts label

Contributor

ttanzhiqiang commented Jun 9, 2025

Use #1101 as the baseline improve tpot 1ms

@ttanzhiqiang do u have accuracy test result of this PR?
curl --location 'http://127.0.0.1:8006/v1/chat/completions' --header 'Content-Type: application/json' --data '{
"top_p": 1,
"model": "/mnt/deepseek/DeepSeek-R1-W8A8-VLLM",
"ignore_eos": true,
"stream": false,
"max_tokens": 100,
"stop": "None",
"top_k": -1,
"temperature": 0.5,
"messages": [
{
"role": "system",
"content": "who are you"
}
]
}'
curl one result is normal

ttanzhiqiang mentioned this pull request

Waiting for BMM NZ support(Improve TPOP 2ms performance) #1131

Merged

github-actions bot removed the merge-conflicts label

chenwaner force-pushed the main branch 2 times, most recently from 8acb0f5 to 0b5e300 Compare

June 9, 2025 07:53

github-actions bot added the merge-conflicts label

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added documentation and removed merge-conflicts labels

chenwaner force-pushed the main branch from acc21f3 to 91df30d Compare

June 9, 2025 08:40

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the merge-conflicts label

chenwaner force-pushed the main branch 3 times, most recently from cabcca0 to aa5b9e4 Compare

June 9, 2025 09:18


          move variable to additional config

906f651

Signed-off-by: chenwaner <861645847@qq.com>

chenwaner force-pushed the main branch from aa5b9e4 to 906f651 Compare

June 9, 2025 09:28

github-actions bot removed the merge-conflicts label

github-actions bot added the merge-conflicts label

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot removed the merge-conflicts label


          merge conflicts

c8cd500

Signed-off-by: chenwaner <861645847@qq.com>

chenwaner force-pushed the main branch from ce08b01 to c8cd500 Compare

June 10, 2025 03:00

github-actions bot added the merge-conflicts label

github-actions bot commented Jun 10, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot removed the merge-conflicts label

chenwaner force-pushed the main branch from c4ff1ab to c8cd500 Compare

June 10, 2025 09:58


          Merge branch 'main' into main

f5290f8

chenwaner changed the title ~~kvcache nz~~ Enable kvcache_nz for the decode process in torchair graph mode

github-actions bot commented Jun 11, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the merge-conflicts label

wangxiyuan approved these changes

View reviewed changes

zzzzwwjj approved these changes

View reviewed changes

github-actions bot removed the merge-conflicts label

chenwaner force-pushed the main branch from 420a180 to f5290f8 Compare

June 11, 2025 03:03


          Merge branch 'main' into main

904916f

wangxiyuan merged commit e46dc14 into vllm-project:main

17 of 18 checks passed

jianzs pushed a commit that referenced this pull request


          Waiting for BMM NZ support(Improve TPOP 2ms performance) (#1131)

### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use #1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>

momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request


          Enable kvcache_nz for the decode process in torchair graph mode (vllm…

bbb5981

…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>

momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request


          Enable kvcache_nz for the decode process in torchair graph mode (vllm…

73a7ca5

…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test:

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request


          Enable kvcache_nz for the decode process in torchair graph mode (vllm…

6ba3c10

…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test:

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request


          Enable kvcache_nz for the decode process in torchair graph mode (vllm…

3b39f93

…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test:

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>

shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request


          Waiting for BMM NZ support(Improve TPOP 2ms performance) (vllm-projec…

490b9d0

…t#1131)

W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

use vllm-project#1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>

chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request


          Enable kvcache_nz for the decode process in torchair graph mode (vllm…

c566533

…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>

chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request


          Waiting for BMM NZ support(Improve TPOP 2ms performance) (vllm-projec…

4681c35

…t#1131)

### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use vllm-project#1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>

Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request


          Enable kvcache_nz for the decode process in torchair graph mode (vllm…

477d3f8

…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>

Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request


          Waiting for BMM NZ support(Improve TPOP 2ms performance) (vllm-projec…

d2f056e

…t#1131)

### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use vllm-project#1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation module:core