L2 Norm-Based KV Cache Compression

🚨 News: The K-norm based compression is now also implemented and available in the Nvidia KVPress library

Repository for A Simple and Effective L2 Norm-Based Method for KV Cache Compression, presented at EMNLP 2024.

TL;DR Tokens with low $L_2$ norm in their key embeddings correlate strongly with high attention scores (see figure). By selectively pruning the KV Cache to retain these important low-norm tokens, we reduce memory usage while maintaining model performance during inference.

Installation

After cloning the repo, install dependencies with:

pip install -r requirements.txt

Quickstart

You can use the $L_2$ Norm-based compression with any model from the Hugging Face Model Hub! Below is an example that demonstrates how to prune the KV cache after a forward pass.

from cache import l2_compress

# Load a pre-trained language model
model = AutoModelForCausalLM.from_pretrained("your_model_id")

# Forward pass with cache enabled
outputs = model(some_input_ids, use_cache=True)

# Compress the KV cache by keeping only the top 90% most significant values
compressed_cache = l2_compress(
    outputs.past_key_values,  # original KV cache
    keep_ratio=0.9,           # percentage of cache to retain based on significance, set to 1 for no compression 
    prune_after=1048,         # prune the KV Cache only if it contains more that this amount of tokens
    skip_layers=[0, 1]        # skip compression for layers 0 and 1
)

# Use the compressed KV cache in a subsequent forward pass
outputs = model(
    some_other_input_ids,                   
    past_key_values=compressed_cache,  
    use_cache=True               
)

Check notebooks/basic_example.ipnyb for an example!

Quick Language Modeling Experiment

To evaluate the model with compressed KV cache on a subset of Wikipedia data, run:

python eval_lm_quick.py \
    --model_id=meta-llama/Meta-Llama-3-8B \
    --chunk_size=3000 \             # size of the input dataset
    --keep_ratio=0.98 \             # percentage of cache to retain based on significance, set to 1 for no compression 
    --prune_after=1000 \            # prune the KV Cache only if it contains more that this amount of tokens
    --output_dir=<your_local_output_dir>

This script will compute and save scores to the specified output directory.

Reproducing Experiments

For full experiments across various tasks, please refer to:

Language Modeling: eval_lm.py
Needle-in-a-Haystack Retrieval: eval_needle.py
Passkey Retrieval: eval_passkey.py

Each script includes command-line arguments to adjust parameters and model configurations.

Citation

If you find this code or our work helpful, please consider citing:

@article{devoto2024simpleeffective,
    title={A Simple and Effective {$L_2$} Norm-Based Strategy for {KV} Cache Compression},
    author={Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale},
    booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2024},
    url={https://arxiv.org/abs/2406.11430}
}

Contact

For more information, feel free to reach out ot Alessio Devoto or Yu Zhao!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
icl		icl
needle		needle
notebooks		notebooks
scripts		scripts
README.md		README.md
cache.py		cache.py
dataset.py		dataset.py
eval_lm.py		eval_lm.py
eval_lm_quick.py		eval_lm_quick.py
eval_needle.py		eval_needle.py
eval_passkey.py		eval_passkey.py
eval_utils.py		eval_utils.py
gen_utils.py		gen_utils.py
models.py		models.py
requirements.txt		requirements.txt
utils.py		utils.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

L2 Norm-Based KV Cache Compression

Installation

Quickstart

Quick Language Modeling Experiment

Reproducing Experiments

Citation

Contact

About

Releases

Packages

Contributors 2

Languages

alessiodevoto/l2compress

Folders and files

Latest commit

History

Repository files navigation

L2 Norm-Based KV Cache Compression

Installation

Quickstart

Quick Language Modeling Experiment

Reproducing Experiments

Citation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages