Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion atom/model_ops/attention_mha.py
Original file line number Diff line number Diff line change
Expand Up @@ -405,4 +405,6 @@ def dispatch_backend(self, fwd_args: ForwardContext):
return self.paged_attention_triton
else:
# Qwen only uses gluon pa decode when bs=64
return self.paged_attention_triton if ctx.batch_size == 64 else self.paged_attention_asm
if ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also make a change for the paged_attention_triton for line 405?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama and qwen will not trigger the if condition in line 404, check line 132. One problem is that some if conditions here maybe not orthogonal enough for different models.
The accuracy of llama is back to normal with this pr:
image

Copy link
Contributor

@scxiao scxiao Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the Qwen3 model failed for the gluon attn here, https://github.com/ROCm/ATOM/actions/runs/20798013041/job/59736412001?pr=56. Do you know why it failed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image the gluon pa api has been changed in aiter side, but the integration remains unchanged in ATOM. bernard_ps_pa_upstream gave a fix and waiting for merge.

return self.paged_attention_triton if ctx.batch_size == 64 else self.paged_attention_asm
return self.paged_attention_asm
Loading