Skip to content

Commit 4e39ecd

Browse files
committed
[Bugfix] fix the oom when chunkprefill with long context like 64k
Signed-off-by: haojiangzheng <justineric096@gmail.com>
1 parent 844a676 commit 4e39ecd

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

vllm_ascend/worker/model_runner_v1.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -829,7 +829,7 @@ def get_supported_tasks(self) -> "tuple[SupportedTask, ...]":
829829
def _make_attention_mask(self, seq_lens, query_lens, position,
830830
attn_state) -> torch.Tensor:
831831
# Chunk Prefill situation.
832-
if attn_state == AscendAttentionState.ChunkedPrefill:
832+
if attn_state == AscendAttentionState.ChunkedPrefill and not self.vllm_config.model_config.use_mla:
833833
return self.attn_mask_builder.get_splitfuse_attn_mask(
834834
seq_lens, query_lens, position, self.dtype, self.device)
835835
# Prefill without cache situation.

0 commit comments

Comments
 (0)