Zhende Song
·
Chenchen Wang
·
Jiamu Sheng
·
Chi Zhang†
·
Jiayuan Fan✦
·
Tao Chen
( † Project Leader, ✦ Corresponding Author )
From Fudan University and Tencent PCG
- [2024.03.03]: Release inference code, evaluation code and model weights.
- [2024.03.13]: Release raw data, check it out here
- [2024.07.02]: All generation code will be released after the work is accepted.
This repository is mainly used for these purposes: data generation code, training code, video evaluation code. We build this repo based on LLaMA-VID. We plan to first release our model, inference and evaluation code and then the rest.
For a better understanding of our training and evaluation process, we suggest running through codes from LLaMA-VID first.
Please follow the instructions below to install the required packages. Our training process is mainly based on LLaMA-VID. And our short video evaluation process is mainly based on quantitative_evaluation from Video-ChatGPT.
- Clone this repository
git clone https://github.com/Deaddawn/MovieLLM-code.git
- Clone LLaMA-VID repository
cd MovieLLM-code
git clone https://github.com/dvlab-research/LLaMA-VID.git
mv eval_movie_qa.py calculate.py LLaMA-VID
mv run_llamavid_movie_answer.py LLaMA-VID/llamavid/serve
- Install Package
conda create -n MovieLLM python=3.10 -y
conda activate MovieLLM
cd LLaMA-VID
pip install -e .
- Install additional packages for video training
pip install ninja
pip install flash-attn --no-build-isolation
We provide our baseline model and model trained on our generated dataset. All models are trained on stage3 of LLaMA-VID. For more detailed information, please refer to LLaMA-VID-model
Type | Max Token | Base LLM | Finetuning Data | Finetuning schedule | Download |
---|---|---|---|---|---|
Long video | 64K | Vicuna-7B-v1.5 | LLaVA1.5-VideoChatGPT-Instruct + LongVideoQA | full_ft-1e | ckpt |
Long video | 64K | Vicuna-7B-v1.5 | LLaVA1.5-VideoChatGPT-Instruct + LongVideoQA+MovieLLMQA | full_ft-1e | ckpt |
This section is mainly used to demonstrate how to set up the data and model environment related to llamavid. Again, we suggest running through from the original LLaMA-VID-preparation. We write this section based on the above with some alteration.
We provide raw dataset generated from our pipeline and also related training data based on LLaMA-VID.
Data generated from our pipeline consists of key frame images, corresponding QAs and dialogues. You can download it from here MovieLLM-Data
To run training process on LLaMA-VID stage-3, processed video data and corresponding QA pairs are needed:
We first preprocess the raw data from MovieNet (used in LLaMA-VID original paper) and the raw data generated from our pipeline.
For data preprocessing from MovieNet, please first download the long video data from MovieNet, shot detection results from here. Place shot detection results under LLaMA-VID-Finetune/movienet/files
before preprocessing. Then please follow the preprocess-instruct to preprocess your data
For processed data from ours, please download it from here MovieLLM-feat (coming soon).
For correspongding QA pairs, please download it from here:
Data file name | Size |
---|---|
long_videoqa_base.json | 240MB |
long_videoqa_ours.json | 245MB |
Please download the pretrained weights from the following link EVA-ViT-G, QFormer-7b
Please organize the video data, QA pairs and weights as the following structure:
LLaMA-VID
├── llamavid
├── scripts
├── work_dirs
│ ├── llama-vid
│ │ ├── llama-vid-7b-full-224-long-video-MovieLLM
│ │ ├── llama-vid-7b-full-224-long-video-baseline
├── model_zoo
│ ├── LAVIS
│ │ ├── eva_vit_g.pth
│ │ ├── instruct_blip_vicuna7b_trimmed.pth
├── data
│ ├── LLaMA-VID-Finetune
│ │ ├── long_videoqa_base.json
│ │ ├── long_videoqa_ours.json
│ │ ├── movienet
│ │ ├── story_feat
│ ├── LLaMA-VID-Eval
│ │ ├── MSRVTT-QA
│ │ ├── MSVD-QA
│ │ ├── video-chatgpt
Coming soon.
For long-video inference on LLaMA-VID, please follow LLaMA-VID-Long-video-preprocess to process your video. Then, please try this for long video inference:
cd LLaMA-VID
python llamavid/serve/run_llamavid_movie.py \
--model-path work_dirs/llama-vid/llama-vid-7b-full-224-long-video \
--video-file <path to your processed video file> \
--load-4bit
We perform evaluation on both short video and long video.
For short video evaluation, please download the evaluation data following Preparation and organize them as in Structure.
Model | MSVD-QA | MSVD-QA Score | MSRVTT-QA | MSRVTT-QA Score | Correctness | Detail | Context | Temporal | Consistency |
---|---|---|---|---|---|---|---|---|---|
Baseline | 49.3 | 3.169 | 43.5 | 2.865 | 1.94 | 2.431 | 2.701 | 1.585 | 1.699 |
Ours | 56.7 | 3.46 | 51.3 | 3.141 | 2.154 | 2.549 | 2.88 | 1.832 | 1.976 |
For MSVD-QA evaluation:
bash scripts/video/eval/msvd_eval.sh
For MSRVTT-QA evaluation:
bash scripts/video/eval/msrvtt_eval.sh
To run long video evaluation, please first download corresponding test-data and QAs.
Then run the following to generate answers for two models (our evaluation methods compare two answers based on reference answer)
python llamavid/serve/run_llamavid_movie_answer.py --model-path <your-model-path> --video-file <test-data-path> --output_path <path-for-saving-answers> --load-4bit --meta_path <QA-path>
Note that in this paper, we run the above for both baseline model and models trained on our data. So, basically, you should have two folders for answers of both models.
Now, you should have following three folders for ground truth, prediction from model 1, prediction from model 2 like the following:
res
|-- baseline
|-- ground_truth
|-- ours
Then run
python eval_movie_qa.py --output_dir ./test/compare_res --api_key <your-api-key> --gt_dir ./res/ground_truth --method_dir ./res/ours --base_dir ./res/basline
Finally
python calculate.py --path ./test/compare_res
If you find our work useful, please consider citing:
@misc{song2024moviellm,
title={MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies},
author={Zhende Song and Chenchen Wang and Jiamu Sheng and Chi Zhang and Gang Yu and Jiayuan Fan and Tao Chen},
year={2024},
eprint={2403.01422},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
We would like to thank the following repos for their great work:
- Our experiment is conducted based on LLaMA-VID.
- We perform short video evaluation based on Video-ChatGPT.
- We build our pipeline based on textual-inversion