[2024/12/09] 🔥 Our paper is coming! We release our paper on Arxiv. Please refer to the paper for more details.
Our method achieves the following rankings with only a 7B-size model:
Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose the Linear Video Tokenizer (LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Blip-3, Molmo, Mipha, InternVL2, Qwen2-VL and Aquila, show-casing the high compatibility of LinVT. Extensive experiments illustrate the effectiveness of LinVT in multi-modal video understanding while preserving the original image-comprehension capabilities.
Install required packages.
conda create -n LinVT python=3.10.13
conda activate LinVT
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 -c pytorch -c conda-forge -y
pip install -r requirements.txt
comming soon.
comming soon.
comming soon.
If you find this repository useful, please consider giving a star ⭐ and citation:
@article{gao2024linvt,
title={LinVT: Empower Your Image-level Large Language Model to Understand Videos},
author={Gao, Lishuai and Zhong, Yujie and Zeng, Yingsen and Tan, Haoxian and Li, Dengjie and Zhao, Zheng},
journal={arXiv preprint arXiv:2412.05185},
year={2024}
}