PruneVid

The official repository for paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".

Xiaohu Huang, Hao Zhou, Kai Han

Webpage ｜ Paper

Introduction

We present PruneVid, a training-free visual token pruning method that enhances efficiency in multi-modal video understanding. By merging spatial-temporal tokens to reduce video redundancy and leveraging attention mechanisms within LLMs to retain only the visual tokens relevant to questions, PruneVid ensures high performance while reducing computational overhead.

Todo:

Code release of PruneVid with PLLaVA.
Code release of PruneVid with LLaVA-OneVision.
Code release of PruneVid with ST-LLM.

License

PruneVid is released under the CC BY-NC-SA 4.0 license.

Performance

We conduct experiments on three video LLMs (PLLaVA, ST-LLM, and LLaVA-OneVision) under for benchmarks: MVBench, VideoMME, Egoschema, and VideoChatgpt-Bench (VCG-Bench).

Method	Retained Ratio	FLOPs (×)	MVBench	VideoMME	EgoSchema Subset / Fullset	TU	CU	CO	DO	CI	Avg
PLLaVA	100.0%	1.00×	46.6	44.4	47.8 / 42.6	2.33	3.62	2.93	2.86	3.21	2.99
PLLaVA w/ FastV	30.0%	0.33×	46.1	43.6	46.2 / 41.0	2.38	3.49	2.89	2.76	3.14	2.93
PLLaVA w/ Prumerge	55.7%	0.53×	45.6	43.8	45.2 / 40.4	2.34	3.52	2.90	2.76	3.15	2.93
PLLaVA w/ Look-M	20.0%	1.00×	46.6	44.3	47.0 / 42.3	2.28	3.41	2.75	2.65	3.00	2.82
PLLaVA w/ Ours	16.2%	0.23×	47.6	45.3	49.0 / 42.6	2.44	3.51	2.99	2.78	3.20	2.98

ST-LLM	100.0%	1.00×	54.9	42.0	56.2 / 45.6	2.46	3.46	2.66	2.63	3.08	2.86
ST-LLM w/ FastV	30.0%	0.37×	42.9	34.5	48.0 / 38.5	2.01	2.23	1.55	1.94	1.69	1.88
ST-LLM w/ Look-M	20.0%	1.00×	54.0	40.6	54.0 / 44.5	2.35	3.41	2.60	2.51	3.01	2.78
ST-LLM w/ Ours	15.1%	0.26×	54.3	41.4	54.6 / 44.7	2.40	3.43	2.63	2.60	3.04	2.82

LLaVA-OneVision	100.0%	1.00×	58.0	58.2	62.0 / 60.0	2.75	3.70	3.39	2.97	3.50	3.26
LLaVA-OneVision w/ FastV	30.0%	0.30×	57.2	57.6	62.6 / 60.0	2.65	3.61	3.28	2.85	3.39	3.16
LLaVA-OneVision w/ Prumerge	55.2%	0.49×	52.9	56.7	62.2 / 60.0	2.72	3.64	3.32	2.94	3.44	3.21
LLaVA-OneVision w/ Look-M	20.0%	1.00×	57.0	58.0	62.0 / 59.8	2.71	3.70	3.29	2.89	3.44	3.21
LLaVA-OneVision w/ Ours	17.0%	0.20×	57.5	58.6	62.6 / 59.5	2.73	3.72	3.28	2.94	3.51	3.24

Data Preparation

All four used benchmarks can be downloaded from huggingface website: MVBench, VideoMME, Egoschema, and VideoChatGPT-Bench.

After downloading the datasets, please put them into the DATAS folder and sort out the source videos and annotations in the following formats:

DATAS/
├── ego_schema/
│   ├── json/
│   └── videos/
├── MVBench/
│   ├── json/
│   └── video/
├── VCGBench/
│   ├── Videos/
│   ├── Zero_Shot_QA/
└── Video-MME/
    ├── data/
    └── json/

Pretrained Model

The pretrained model can be found in their respective repositories: PLLaVA, ST-LLM, and LLaVA-OneVision.

After downloading the models please put them into the MODELS folder:

MODELS/
├── pllava-7b/

Environment Install

We follow the environment installation guideline of PLLaVA.

Above all, the following environment set up is for python 3.10. If you choose to use conda for environment setup, we recommend creating the virtual environment with:

conda create -n pllava python=3.10

Firstly, install pytorch from the official website. The code runs on torch 2.2.1, cu118 or cu122. Select the version that suits your drive version.

torch                       2.2.1+cu118
torchaudio                  2.2.1+cu118
torchvision                 0.17.1+cu118

If your driver version is higher than cu121, you could probably try installing with the following scripts:

pip install -r requirements.txt

Otherwise, you would need to install a torch for your server first, then install the other packages:

pip install -r requirements.torch.txt # decide your own requirements, (this is for cu11), or install torch directly following the official website.
pip install -r requirements.no_torch.txt # install the following

Evaluation

As PruneVid is a training-free method, we can directly apply it on the pre-trained models.

The provided scripts for evaluating model performance is given in scripts/eval.sh. Below is the script for evaluating the performance on MVBench, where you can edit the hyper-parameters whatever you want. The default setting is used in our paper.

lora_alpha=14
selected_layers=(10)
alphas=(0.4)
taus=(0.8)
temporal_segment_ratios=(0.25)
cluster_ratios=(0.5)

for alpha in "${alphas[@]}"; do
  for selected_layer in "${selected_layers[@]}"; do
    for tau in "${taus[@]}"; do
      for temporal_segment_ratio in "${temporal_segment_ratios[@]}"; do
        for cluster_ratio in "${cluster_ratios[@]}"; do
          # 执行命令
          SAVE_DIR=test_results/pllava-7b-lora${lora_alpha}-threshold${tau}-layer${selected_layer}-alpha${alpha}-temporal-segment-ratio-${temporal_segment_ratio}-cluster-ratio-${cluster_ratio}
          mkdir -p "${SAVE_DIR}"
          conv_mode=eval_mvbench
          python -m tasks.eval.mvbench.pllava_eval_mvbench \
              --pretrained_model_name_or_path ${model_dir} \
              --save_path ${SAVE_DIR}/mvbench \
              --num_frames ${num_frames} \
              --use_lora \
              --lora_alpha ${lora_alpha} \
              --top_p 1.0 \
              --temperature 1.0 \
              --weight_dir ${weight_dir} \
              --pooling_shape 16-12-12 \
              --conv_mode ${conv_mode} \
              --selected_layer ${selected_layer} \
              --alpha ${alpha} \
              --tau ${tau} \
              --temporal_segment_ratio ${temporal_segment_ratio} \
              --cluster_ratio ${cluster_ratio}
        done
      done
    done
  done
done

As for Egoschema, which needs an external service to evaluate the model performance, we run the evaluate_egoschema_result.py for evaluation. Before executing the file, you should change the root_dir variable to your folder.

python evaluate_egoschema_result.py

Acknowledgement

This repository is built upon PLLaVA, ST-LLM, and LLaVA-OneVision. Thanks for those well-organized codebases.

Citation

@inproceedings{
  huang2024prunevid,
  title={PruneVid: Visual Token Pruning for Efficient Video Large Language Models},
  author={Xiaohu Huang and Hao Zhou and Kai Han},
  booktitle={arXiv},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PruneVid

Introduction

Todo:

License

Performance

Data Preparation

Pretrained Model

Environment Install

Evaluation

Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
dataset		dataset
example		example
models		models
scripts		scripts
tasks		tasks
utils		utils
README.md		README.md
evaluate_egoschema_result.py		evaluate_egoschema_result.py
requirements.no_torch.txt		requirements.no_torch.txt
requirements.torch.txt		requirements.torch.txt
requirements.txt		requirements.txt

Visual-AI/PruneVid

Folders and files

Latest commit

History

Repository files navigation

PruneVid

Introduction

Todo:

License

Performance

Data Preparation

Pretrained Model

Environment Install

Evaluation

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages