MoA Kernel

This is the CUDA kernel implementation for MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, or MoA.

Installation

We test our kernel with CUDA 12.4 and PyTorch 2.4. Install the required environments for MoA before installing the kernel.

cd python
FLASHINFER_LOGITS_POST_HOOKS=0 FLASHINFER_HEAD_DIMS=64,128 FLASHINFER_POS_ENCODING_MODES=0 python setup.py install

python accuracy_test.py

Our kernel is build upon FlashInfer project.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
3rdparty		3rdparty
cmake		cmake
docs		docs
include/flashinfer		include/flashinfer
python		python
scripts		scripts
src		src
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
accuracy_test.py		accuracy_test.py
accuracy_test_decode.py		accuracy_test_decode.py
release-please-config.json		release-please-config.json
speed_test.py		speed_test.py
version.txt		version.txt