FlashMLA

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:

BF16, FP16
Paged kvcache with block size of 64

Quick start

Install

python setup.py install

Benchmark

python tests/test_flash_mla.py

Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.

Usage

from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...

Requirements

Hopper GPUs
CUDA 12.3 and above
PyTorch 2.0 and above

Acknowledgement

FlashMLA is inspired by FlashAttention 2&3 and cutlass projects.

Citation

@misc{flashmla2025,
      title={FlashMLA: Efficient MLA decoding kernel}, 
      author={Jiashi Li},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/deepseek-ai/FlashMLA}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
benchmark		benchmark
csrc		csrc
flash_mla		flash_mla
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlashMLA

Quick start

Install

Benchmark

Usage

Requirements

Acknowledgement

Citation

About

Releases

Packages

Contributors 7

Languages

License

deepseek-ai/FlashMLA

Folders and files

Latest commit

History

Repository files navigation

FlashMLA

Quick start

Install

Benchmark

Usage

Requirements

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages