GitHub - TerryPei/CSP: Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

⚖️ Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

Xiaohuan Pei¹,Tao Huang¹,Chang Xu¹

¹ University of Sydney

If you find our project helpful, please give us a star ⭐ on GitHub to stay updated.

Overview

We propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities), enabling more precise KV cache pruning by independently managing these distinct attention types. Additionally, we introduce an n-softmax function to counteract distribution shifts caused by pruning, preserving the original smoothness of attention scores and ensuring stable performance. Our final training-free method, Cross-Self Pruning (CSP), achieves competitive performance compared to models with full KV caches while significantly outperforming previous pruning methods. Extensive evaluations on MileBench, a benchmark encompassing 29 multimodal datasets, demonstrate CSP's effectiveness, achieving up to a 41% performance improvement on challenging tasks like conversational embodied dialogue while reducing the KV cache budget by 13.6%.

Usage

Environment Setup

The Environments Setup is consistent with Milebench and LOOK-M

conda create -n CSP
pip install -r requirements.txt

Test CSP and other methods:

conda activate CSP
bash ./scripts/new_eval.sh

Demonstrate score

For clearly view the results, you can run

python eval_score.py

The generated eval_score.md will demonstrate the scores for each dataset.

We employed LLaVA-v1.5-7b on RTX-4090 GPUs with flash-attn-2.4.3post1 and LLaVA-v1.5-13b on A100 GPUs with flash-attn-2.6.3 to conduct our experiments.

Acknowledgment

Our code structure is based on MileBench [code] and LOOK-M [code]. Deeply thanks for their excellent works.

Citation

If this paper helps your research, we would appreciate your citation:

@article{pei2024cross,
  title={Cross-Self KV Cache Pruning for Efficient Vision-Language Inference},
  author={Pei, Xiaohuan and Huang, Tao and Xu, Chang},
  journal={arXiv preprint arXiv:2412.04652},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LLaVA-mix_merge_v1/llava		LLaVA-mix_merge_v1/llava
configs		configs
figs		figs
outputs		outputs
scripts		scripts
workers		workers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.py		data.py
eval_score.md		eval_score.md
eval_score.py		eval_score.py
generate.py		generate.py
requirements.txt		requirements.txt
score.py		score.py
test_llava.py		test_llava.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚖️ Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

If you find our project helpful, please give us a star ⭐ on GitHub to stay updated.

Overview

Usage

Environment Setup

Test CSP and other methods:

Demonstrate score

Acknowledgment

Citation

If this paper helps your research, we would appreciate your citation:

About

Releases

Packages

Languages

License

TerryPei/CSP

Folders and files

Latest commit

History

Repository files navigation

⚖️ Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

If you find our project helpful, please give us a star ⭐ on GitHub to stay updated.

Overview

Usage

Environment Setup

Test CSP and other methods:

Demonstrate score

Acknowledgment

Citation

If this paper helps your research, we would appreciate your citation:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages