This is the CUDA kernel implementation for MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
, or MoA.
We test our kernel with
CUDA 12.4
andPyTorch 2.4
. Install the required environments for MoA before installing the kernel.
cd python
FLASHINFER_LOGITS_POST_HOOKS=0 FLASHINFER_HEAD_DIMS=64,128 FLASHINFER_POS_ENCODING_MODES=0 python setup.py install
python accuracy_test.py
Our kernel is build upon FlashInfer project.
- support batch size > 1
- support multi-GPU inference
- support GQA