The official repository for paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".
Xiaohu Huang, Hao Zhou, Kai Han
We present PruneVid, a training-free visual token pruning method that enhances efficiency in multi-modal video understanding. By merging spatial-temporal tokens to reduce video redundancy and leveraging attention mechanisms within LLMs to retain only the visual tokens relevant to questions, PruneVid ensures high performance while reducing computational overhead.
- Code release of PruneVid with PLLaVA.
- Code release of PruneVid with LLaVA-OneVision.
- Code release of PruneVid with ST-LLM.
PruneVid is released under the CC BY-NC-SA 4.0 license
.
We conduct experiments on three video LLMs (PLLaVA, ST-LLM, and LLaVA-OneVision) under for benchmarks: MVBench, VideoMME, Egoschema, and VideoChatgpt-Bench (VCG-Bench).
Method | Retained Ratio | FLOPs (×) | MVBench | VideoMME | EgoSchema Subset / Fullset | TU | CU | CO | DO | CI | Avg |
---|---|---|---|---|---|---|---|---|---|---|---|
PLLaVA | 100.0% | 1.00× | 46.6 | 44.4 | 47.8 / 42.6 | 2.33 | 3.62 | 2.93 | 2.86 | 3.21 | 2.99 |
PLLaVA w/ FastV | 30.0% | 0.33× | 46.1 | 43.6 | 46.2 / 41.0 | 2.38 | 3.49 | 2.89 | 2.76 | 3.14 | 2.93 |
PLLaVA w/ Prumerge | 55.7% | 0.53× | 45.6 | 43.8 | 45.2 / 40.4 | 2.34 | 3.52 | 2.90 | 2.76 | 3.15 | 2.93 |
PLLaVA w/ Look-M | 20.0% | 1.00× | 46.6 | 44.3 | 47.0 / 42.3 | 2.28 | 3.41 | 2.75 | 2.65 | 3.00 | 2.82 |
PLLaVA w/ Ours | 16.2% | 0.23× | 47.6 | 45.3 | 49.0 / 42.6 | 2.44 | 3.51 | 2.99 | 2.78 | 3.20 | 2.98 |
ST-LLM | 100.0% | 1.00× | 54.9 | 42.0 | 56.2 / 45.6 | 2.46 | 3.46 | 2.66 | 2.63 | 3.08 | 2.86 |
ST-LLM w/ FastV | 30.0% | 0.37× | 42.9 | 34.5 | 48.0 / 38.5 | 2.01 | 2.23 | 1.55 | 1.94 | 1.69 | 1.88 |
ST-LLM w/ Look-M | 20.0% | 1.00× | 54.0 | 40.6 | 54.0 / 44.5 | 2.35 | 3.41 | 2.60 | 2.51 | 3.01 | 2.78 |
ST-LLM w/ Ours | 15.1% | 0.26× | 54.3 | 41.4 | 54.6 / 44.7 | 2.40 | 3.43 | 2.63 | 2.60 | 3.04 | 2.82 |
LLaVA-OneVision | 100.0% | 1.00× | 58.0 | 58.2 | 62.0 / 60.0 | 2.75 | 3.70 | 3.39 | 2.97 | 3.50 | 3.26 |
LLaVA-OneVision w/ FastV | 30.0% | 0.30× | 57.2 | 57.6 | 62.6 / 60.0 | 2.65 | 3.61 | 3.28 | 2.85 | 3.39 | 3.16 |
LLaVA-OneVision w/ Prumerge | 55.2% | 0.49× | 52.9 | 56.7 | 62.2 / 60.0 | 2.72 | 3.64 | 3.32 | 2.94 | 3.44 | 3.21 |
LLaVA-OneVision w/ Look-M | 20.0% | 1.00× | 57.0 | 58.0 | 62.0 / 59.8 | 2.71 | 3.70 | 3.29 | 2.89 | 3.44 | 3.21 |
LLaVA-OneVision w/ Ours | 17.0% | 0.20× | 57.5 | 58.6 | 62.6 / 59.5 | 2.73 | 3.72 | 3.28 | 2.94 | 3.51 | 3.24 |
All four used benchmarks can be downloaded from huggingface website: MVBench
, VideoMME
, Egoschema
, and VideoChatGPT-Bench
.
After downloading the datasets, please put them into the DATAS
folder and sort out the source videos and annotations in the following formats:
DATAS/
├── ego_schema/
│ ├── json/
│ └── videos/
├── MVBench/
│ ├── json/
│ └── video/
├── VCGBench/
│ ├── Videos/
│ ├── Zero_Shot_QA/
└── Video-MME/
├── data/
└── json/
The pretrained model can be found in their respective repositories: PLLaVA
, ST-LLM
, and LLaVA-OneVision
.
After downloading the models please put them into the MODELS folder:
MODELS/
├── pllava-7b/
We follow the environment installation guideline of PLLaVA
.
- Above all, the following environment set up is for python 3.10. If you choose to use conda for environment setup, we recommend creating the virtual environment with:
conda create -n pllava python=3.10
- Firstly, install pytorch from the official website. The code runs on torch 2.2.1, cu118 or cu122. Select the version that suits your drive version.
torch 2.2.1+cu118
torchaudio 2.2.1+cu118
torchvision 0.17.1+cu118
If your driver version is higher than cu121, you could probably try installing with the following scripts:
pip install -r requirements.txt
Otherwise, you would need to install a torch for your server first, then install the other packages:
pip install -r requirements.torch.txt # decide your own requirements, (this is for cu11), or install torch directly following the official website.
pip install -r requirements.no_torch.txt # install the following
As PruneVid is a training-free method, we can directly apply it on the pre-trained models.
The provided scripts for evaluating model performance is given in scripts/eval.sh
. Below is the script for evaluating the performance on MVBench, where you can edit the hyper-parameters whatever you want. The default setting is used in our paper.
lora_alpha=14
selected_layers=(10)
alphas=(0.4)
taus=(0.8)
temporal_segment_ratios=(0.25)
cluster_ratios=(0.5)
for alpha in "${alphas[@]}"; do
for selected_layer in "${selected_layers[@]}"; do
for tau in "${taus[@]}"; do
for temporal_segment_ratio in "${temporal_segment_ratios[@]}"; do
for cluster_ratio in "${cluster_ratios[@]}"; do
# 执行命令
SAVE_DIR=test_results/pllava-7b-lora${lora_alpha}-threshold${tau}-layer${selected_layer}-alpha${alpha}-temporal-segment-ratio-${temporal_segment_ratio}-cluster-ratio-${cluster_ratio}
mkdir -p "${SAVE_DIR}"
conv_mode=eval_mvbench
python -m tasks.eval.mvbench.pllava_eval_mvbench \
--pretrained_model_name_or_path ${model_dir} \
--save_path ${SAVE_DIR}/mvbench \
--num_frames ${num_frames} \
--use_lora \
--lora_alpha ${lora_alpha} \
--top_p 1.0 \
--temperature 1.0 \
--weight_dir ${weight_dir} \
--pooling_shape 16-12-12 \
--conv_mode ${conv_mode} \
--selected_layer ${selected_layer} \
--alpha ${alpha} \
--tau ${tau} \
--temporal_segment_ratio ${temporal_segment_ratio} \
--cluster_ratio ${cluster_ratio}
done
done
done
done
done
As for Egoschema, which needs an external service to evaluate the model performance, we run the evaluate_egoschema_result.py
for evaluation. Before executing the file, you should change the root_dir
variable to your folder.
python evaluate_egoschema_result.py
This repository is built upon PLLaVA
, ST-LLM
, and LLaVA-OneVision
. Thanks for those well-organized codebases.
@inproceedings{
huang2024prunevid,
title={PruneVid: Visual Token Pruning for Efficient Video Large Language Models},
author={Xiaohu Huang and Hao Zhou and Kai Han},
booktitle={Arxiv},
year={2024}
}