Skip to content

Commit 084a9da

Browse files
authored
[Bugfix] Disable FlexAttention direct block mask building for encoder-only models (#27344)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
1 parent c9461e0 commit 084a9da

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

vllm/v1/attention/backends/flex_attention.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -658,7 +658,10 @@ def build(
658658
total_cache_tokens=total_cache_tokens,
659659
decode_offset=offset_tensor,
660660
num_blocks_per_seq=num_blocks_per_seq,
661-
direct_build=self.direct_build,
661+
# FIXME(Isotr0py): direct build has issue to build bidirectional
662+
# attention block mask for encoder-only models, disable it temporarily.
663+
# see: https://github.com/vllm-project/vllm/pull/27329#issuecomment-3431484053
664+
direct_build=(self.direct_build and common_attn_metadata.causal),
662665
q_block_size=self.q_block_size,
663666
kv_block_size=self.kv_block_size,
664667
)

0 commit comments

Comments
 (0)