Skip to content

Conversation

@chenwaner
Copy link
Contributor

@chenwaner chenwaner commented Jun 3, 2025

What this PR does / why we need it?

Enable kvcache_nz for the decode process in the graph mode, which reduces the time consumed by FA in long sequences.

Does this PR introduce any user-facing change?

If need to enable kvcache_nz, should set the environment variable VLLM_ENABLE_KV_NZ: "1"

How was this patch tested?

Tested in deepseek model, with batchsize 64 and seq_len 1k+3k: 61 layers FA total time 20.80ms -> 19.76ms

@wangxiyuan wangxiyuan mentioned this pull request Jun 4, 2025
76 tasks
@wangxiyuan wangxiyuan changed the title 【修改说明】kvcache nz kvcache nz Jun 4, 2025
@chenwaner chenwaner changed the title kvcache nz [WIP]kvcache nz Jun 4, 2025
@github-actions
Copy link

github-actions bot commented Jun 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@chenwaner chenwaner changed the title [WIP]kvcache nz kvcache nz Jun 5, 2025
@realliujiaxu
Copy link
Contributor

torch_npu.npu_fused_infer_attention_score only support K V with format of ND(https://www.hiascend.com/document/detail/zh/Pytorch/700/apiref/apilist/ptaoplist_001232.html). Does this PR require newer vision of torch_npu and CANN?

return to use layout=BNSD in npu_fused_infer_attention_score when KV_NZ disabled
@github-actions
Copy link

github-actions bot commented Jun 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

github-actions bot commented Jun 6, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@chenwaner chenwaner closed this by deleting the head repository Jun 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants