Xiaohuan Pei1,Tao Huang1,Chang Xu1
1 University of Sydney
We propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities), enabling more precise KV cache pruning by independently managing these distinct attention types. Additionally, we introduce an n-softmax function to counteract distribution shifts caused by pruning, preserving the original smoothness of attention scores and ensuring stable performance. Our final training-free method, Cross-Self Pruning (CSP), achieves competitive performance compared to models with full KV caches while significantly outperforming previous pruning methods. Extensive evaluations on MileBench, a benchmark encompassing 29 multimodal datasets, demonstrate CSP's effectiveness, achieving up to a 41% performance improvement on challenging tasks like conversational embodied dialogue while reducing the KV cache budget by 13.6%.
The Environments Setup is consistent with Milebench and LOOK-M
conda create -n CSP
pip install -r requirements.txt
conda activate CSP
bash ./scripts/new_eval.sh
For clearly view the results, you can run
python eval_score.py
The generated eval_score.md will demonstrate the scores for each dataset.
We employed LLaVA-v1.5-7b on RTX-4090 GPUs with flash-attn-2.4.3post1 and LLaVA-v1.5-13b on A100 GPUs with flash-attn-2.6.3 to conduct our experiments.
Our code structure is based on MileBench [code] and LOOK-M [code]. Deeply thanks for their excellent works.
@article{pei2024cross,
title={Cross-Self KV Cache Pruning for Efficient Vision-Language Inference},
author={Pei, Xiaohuan and Huang, Tao and Xu, Chang},
journal={arXiv preprint arXiv:2412.04652},
year={2024}
}