attention-impl/README.md at master · ita9naiwa/attention-impl · GitHub

CUDA torch functions for LLM

For study purpose

implemented attentions

Naive Attention
Attention with KV
Attention with non-contagious memory
Single Query Attention with non-contagious KV cache (PagedAttention with block size 1)
Multi Query Attention with non-contagious KV cache (for Speculative Decoding)
Rotary Embedding