perf: accelerate gqa performance #356

yzh119 · 2024-07-03T08:16:29Z

Changes:

Prefetch page indices (we have already done such optimization on decode kernels, but not on append/prefill kernels which was used in GQA).
Unlock 1x4 warp layout in perf: use 1x4 warp layout for small query length #322, we didn't enable this because the binary size is too large, we should further reduce some unnecessary template arguments.
Optimize threadblock_sync_mdo_states for efficient merging attention states of multiple warps in a threadblock. Our previous implementation assumes small shared memory size and interleaves shared memory reads/writes with computations, which is not as efficient as a bulk shared memory access.

After this PR, the GQA kernel execution time (on H100) for setting batch_size=128, seq_len=1024, num_qo_heads=32, num_kv_heads=4, head_dim=128 was improved from 133us to 103us.

@Yard1

yzh119 added 2 commits July 3, 2024 08:09

upd

f4a66d5

upd

44e65fe

yzh119 merged commit e56ddad into main Jul 4, 2024

github-actions bot mentioned this pull request Jul 4, 2024

chore(main): release 0.0.9 #359

Merged

yzh119 deleted the accelerate-gqa branch July 5, 2024 22:57

github-actions bot mentioned this pull request Jul 31, 2024

chore(main): release 0.1.4 #415

Merged

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: accelerate gqa performance #356

perf: accelerate gqa performance #356

yzh119 commented Jul 3, 2024 •

edited

Loading

perf: accelerate gqa performance #356

perf: accelerate gqa performance #356

Conversation

yzh119 commented Jul 3, 2024 • edited Loading

yzh119 commented Jul 3, 2024 •

edited

Loading