GPU Pallas `decode_attention` improvements #24602

rdyro · 2024-10-29T21:02:03Z

Changes:

adding start_idx and kv_seq_len to support fixed-sized cache decode and sliding window decode attention
fixing a bug where q_heads // kv_heads > block_h would cause out-of-bound indexing
skipping work before start_idx and after kv_seq_len
changing default sm_scale to 1 / math.sqrt(q.shape[-1]) to match jax.nn.dot_product_attention default

rdyro changed the title ~~Decode attention improvements~~ GPU Pallas decode_attention improvements Oct 29, 2024

rdyro requested review from zhangqiaorjc and sharadmv October 29, 2024 21:24

rdyro force-pushed the rdyro-decode-attention-mask branch from ec63f1c to 50f71ab Compare October 30, 2024 01:41

rdyro changed the title ~~GPU Pallas decode_attention improvements~~ [WIP] GPU Pallas decode_attention improvements Oct 31, 2024

rdyro force-pushed the rdyro-decode-attention-mask branch 2 times, most recently from 8b0912b to 4e6527a Compare October 31, 2024 19:32

rdyro changed the title ~~[WIP] GPU Pallas decode_attention improvements~~ GPU Pallas decode_attention improvements Oct 31, 2024

rdyro requested a review from mattjj November 5, 2024 23:06

Adding start index and kv_seq_len to decode kernel

d62510b

rdyro force-pushed the rdyro-decode-attention-mask branch from 2a3d19b to d62510b Compare November 5, 2024 23:52

sharadmv approved these changes Nov 6, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Nov 6, 2024

kokoro-team removed the kokoro:force-run label Nov 6, 2024

copybara-service bot merged commit 37af100 into jax-ml:main Nov 6, 2024
17 of 18 checks passed

Provide feedback