-
Notifications
You must be signed in to change notification settings - Fork 538
Enable kvcache_nz for the decode process in torchair graph mode #1098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: chenwaner <861645847@qq.com> By set env VLLM_ENABLE_KV_NZ to enable kvcache_nz, so that the kvcache layout will be NZ in the decode process in the graph mode. Disable the optimaition, the kvcache layout is ND by default. Date: Fri Jun 6 11:19:12 2025 +0800
|
Have you test the accuracy from dataset, and if possible, can you paste the performance boost rate from your test? |
|
Use #1101 as the baseline |
| self.qk_rope_head_dim) | ||
|
|
||
| attn_output, _ = torch.ops.npu.npu_fused_infer_attention_score( | ||
| if self.enable_kv_nz: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realize that this feature rely on torchair graph right? So please move this flag to ascend config(i.e. addintional config) instead of env.
029afcf to
aa7ea20
Compare
@ttanzhiqiang do u have accuracy test result of this PR? |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
8acb0f5 to
0b5e300
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
cabcca0 to
aa5b9e4
Compare
Signed-off-by: chenwaner <861645847@qq.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: chenwaner <861645847@qq.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
### What this PR does / why we need it? W_UV/W_UK_T cannot be converted to nz, because this position will be fused into transposebatchmatmul, which does not support nz. The weights are actually converted back to nd in each run. ### Does this PR introduce _any_ user-facing change? Use #1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms ### How was this patch tested? use #1101 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com>
…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com>
…t#1131) W_UV/W_UK_T cannot be converted to nz, because this position will be fused into transposebatchmatmul, which does not support nz. The weights are actually converted back to nd in each run. Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms use vllm-project#1101 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com>
…t#1131) ### What this PR does / why we need it? W_UV/W_UK_T cannot be converted to nz, because this position will be fused into transposebatchmatmul, which does not support nz. The weights are actually converted back to nd in each run. ### Does this PR introduce _any_ user-facing change? Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms ### How was this patch tested? use vllm-project#1101 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
…-project#1098) What this PR does / why we need it? Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences. Does this PR introduce any user-facing change? If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True How was this patch tested? 1. Tested in deepseek model: with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms 2. operator precision test: [aclnnFusedInferAttentionScoreV3_result.csv](https://github.com/user-attachments/files/20664138/aclnnFusedInferAttentionScoreV3_result.csv) 3. tpot test from @ttanzhiqiang, and curl one result is normal vllm-project#1098 (comment) vllm-project#1098 (comment) --------- Signed-off-by: chenwaner <861645847@qq.com>
…t#1131) ### What this PR does / why we need it? W_UV/W_UK_T cannot be converted to nz, because this position will be fused into transposebatchmatmul, which does not support nz. The weights are actually converted back to nd in each run. ### Does this PR introduce _any_ user-facing change? Use vllm-project#1098 as the baseline, p90 TPOT 90.79ms->88.58ms, improve TPOP 2ms ### How was this patch tested? use vllm-project#1101 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>


What this PR does / why we need it?
Enable kvcache_nz for the decode process in torchair graph mode, which reduces the time consumed by FA in long sequences.
Does this PR introduce any user-facing change?
If need to enable kvcache_nz, should set the additional_config.torchair_graph_config.enable_kv_nz=True
How was this patch tested?
with batchsize 64 and seq_len 1k+3k, 61 layers FA total time improves 20.80ms -> 19.76ms
aclnnFusedInferAttentionScoreV3_result.csv
Enable kvcache_nz for the decode process in torchair graph mode #1098 (comment)
Enable kvcache_nz for the decode process in torchair graph mode #1098 (comment)