Skip to content

Conversation

@chenwaner
Copy link
Contributor

@chenwaner chenwaner commented Jun 6, 2025

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?

  1. Tested in deepseek model:
    with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms
  2. operator precision test:
    aclnnFusedInferAttentionScoreV3_result.csv
  3. tpot test from @ttanzhiqiang, and curl one result is normal
    Enable kvcache_nz for the decode process in torchair graph mode #1098 (comment)
    Enable kvcache_nz for the decode process in torchair graph mode #1098 (comment)

Signed-off-by: chenwaner <861645847@qq.com>

By set env VLLM_ENABLE_KV_NZ to enable kvcache_nz, so that the kvcache layout will be NZ in the decode process in the graph mode. Disable the optimaition, the kvcache layout is ND by default.

Date:      Fri Jun 6 11:19:12 2025 +0800
@ganyi1996ppo
Copy link
Collaborator

Have you test the accuracy from dataset, and if possible, can you paste the performance boost rate from your test?

@ttanzhiqiang
Copy link
Contributor

Use #1101 as the baseline
截屏2025-06-06 16 41 36
improve tpot 1ms

self.qk_rope_head_dim)

attn_output, _ = torch.ops.npu.npu_fused_infer_attention_score(
if self.enable_kv_nz:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realize that this feature rely on torchair graph right? So please move this flag to ascend config(i.e. addintional config) instead of env.

@chenwaner chenwaner force-pushed the main branch 2 times, most recently from 029afcf to aa7ea20 Compare June 6, 2025 10:04
@chenwaner
Copy link
Contributor Author

Use #1101 as the baseline 截屏2025-06-06 16 41 36 improve tpot 1ms

@ttanzhiqiang do u have accuracy test result of this PR?

@github-actions
Copy link

github-actions bot commented Jun 7, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@ttanzhiqiang
Copy link
Contributor

Use #1101 as the baseline 截屏2025-06-06 16 41 36 improve tpot 1ms

@ttanzhiqiang do u have accuracy test result of this PR?
curl --location 'http://127.0.0.1:8006/v1/chat/completions' --header 'Content-Type: application/json' --data '{
"top_p": 1,
"model": "/mnt/deepseek/DeepSeek-R1-W8A8-VLLM",
"ignore_eos": true,
"stream": false,
"max_tokens": 100,
"stop": "None",
"top_k": -1,
"temperature": 0.5,
"messages": [
{
"role": "system",
"content": "who are you"
}
]
}'
curl one result is normal

@github-actions
Copy link

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions github-actions bot added documentation Improvements or additions to documentation and removed merge-conflicts labels Jun 9, 2025
@github-actions
Copy link

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@chenwaner chenwaner force-pushed the main branch 3 times, most recently from cabcca0 to aa5b9e4 Compare June 9, 2025 09:18
Signed-off-by: chenwaner <861645847@qq.com>
@github-actions
Copy link

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: chenwaner <861645847@qq.com>
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@chenwaner chenwaner changed the title kvcache nz Enable kvcache_nz for the decode process in torchair graph mode Jun 10, 2025
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangxiyuan wangxiyuan merged commit e46dc14 into vllm-project:main Jun 11, 2025
17 of 18 checks passed
jianzs pushed a commit that referenced this pull request Jun 15, 2025
### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use #1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test:

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test:

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025
…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test:

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025
…t#1131)

W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

use vllm-project#1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…t#1131)

### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use vllm-project#1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…-project#1098)

What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which
reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the
additional_config.torchair_graph_config.enable_kv_nz=True

How was this patch tested?
1. Tested in deepseek model:
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves
20.80ms -> 19.76ms
2. operator precision test: 

[aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv)
3. tpot test from @ttanzhiqiang, and curl one result is normal

vllm-project#1098 (comment)

vllm-project#1098 (comment)

---------

Signed-off-by: chenwaner <861645847@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…t#1131)

### What this PR does / why we need it?
W_UV/W_UK_T cannot be converted to nz, because this position will be
fused into transposebatchmatmul, which does not support nz. The weights
are actually converted back to nd in each run.

### Does this PR introduce _any_ user-facing change?
Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms

### How was this patch tested?
use vllm-project#1101

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation module:core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants