FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

This is the official repository of "FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation".

FastKV introduces a novel Token-Selective Propagation (TSP) approach, selectively propagating only critical tokens to layer layers while retaining full-context information in early layers.
This method significantly reduces KV cache size while maintaining accuracy, leading to improved latency and efficiency in long-context processing of LLMs.
FastKV integrates GQA-aware KV cache compression, further optimizing memory and computation while leveraging grouped-query attention.
Experimental results demonstrate that FastKV achieves up to 1.97× speedup in Time-To-First-Token (TTFT) and 5.07× higher throughput compared to full-context inference with 128k input tokens, all while preserving long-context accuracy.

For more details, please check out our paper.

Usage

1. Installation

Installation with the requirements package.

conda create -n fastkv python=3.9
conda activate fastkv
cd FastKV
pip install -r requirements.txt
pip install flash-attn==2.6.3

# For AdaKV and HeadKV
cd baseline/adakv
make i

2. Quick Start

Inference with FastKV methods and evaluation for LongBench, Needle-in-a-Haystack, and speedup benchmark.

# Run LongBench Evaluation
./scripts/run_longbench.sh

# Run Needle-in-a-Haystack Evaluation
./scripts/run_needle.sh

# Run TTFT Benchmark
./scripts/run_ttft.sh

# Run Throughput Benchmark
./scripts/run_throughput.sh

Model Support

	FastKV	GemFilter	SnapKV	AdaKV	HeadKV
LLaMA	O	O	O	O	O
Mistral	O	O	O	O	O

Acknowledgements

Our implementation of FastKV is based on codes from SnapKV repository.

We have integrated the baseline methods (SnapKV, AdaKV, HeadKV, GemFilter) for experiments and evaluations, thanks to their open-source contributions.

Citation

If you use the FastKV approach in your research, please consider citing:

@article{fastkv,
  title={FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation},
  author={Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim},
  journal={arXiv preprint arXiv:2502.01068},
  year={2025}
  }

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
baseline		baseline
benchmark		benchmark
eval		eval
images		images
scripts		scripts
utils		utils
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Usage

1. Installation

2. Quick Start

Model Support

Acknowledgements

Citation

About

Releases

Packages

Contributors 2

Languages

License

dongwonjo/FastKV

Folders and files

Latest commit

History

Repository files navigation

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Usage

1. Installation

2. Quick Start

Model Support

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages