We introduce an innovative and out-of-box KV cache compression method, SnapKV.
Currently tested with transformers==4.37.0
, need to check if it is compatible with higher version.
transformers>=4.36
flash-attn==2.4.0
git clone git@github.com:FasterDecoding/SnapKV.git
cd SnapKV
pip install -e .
For example:
from snapkv.monkeypatch.monkeypatch import replace_mistral
replace_mistral() # Use monkey patches enable SnapKV
Check the example notebook.
SnapKV can be easily integrated with other models.
You can follow the comment marked with [SnapKV]
in existing models to construct your own models. (Currently we support Llama family/ Mistral/ Mixtral)
The detailed algorithm of SnapKV is in snapkv_utils.py
- Add observation experiments for reduplication.
- Add LongBench for reduplication.
- Explore the prompt phase compression.
If you feel this project is helpful, please consider cite our report 😊
@article{li2024snapkv,
title={SnapKV: LLM Knows What You are Looking for Before Generation},
author={Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming},
journal={arXiv preprint arXiv:2404.14469},
year={2024}
}