Skip to content

Latest commit

 

History

History
262 lines (200 loc) · 10.8 KB

README.md

File metadata and controls

262 lines (200 loc) · 10.8 KB

MovieLLM:
Enhancing Long Video Understanding with AI-Generated Movies

Zhende Song · Chenchen Wang · Jiamu Sheng · Chi Zhang† · Jiayuan Fan✦ · Tao Chen
( † Project Leader, ✦ Corresponding Author )
From Fudan University and Tencent PCG

Paper PDF Project Page

We propose MovieLLM, a novel framework designed to create synthetic, high-quality data for long videos. This framework leverages the power of GPT-4 and text-to-image models to generate detailed scripts and corresponding visuals.

Changelog

  • [2024.03.03]: Release inference code, evaluation code and model weights.
  • [2024.03.13]: Release raw data, check it out here
  • [2024.07.02]: All generation code will be released after the work is accepted.

Summary

This repository is mainly used for these purposes: data generation code, training code, video evaluation code. We build this repo based on LLaMA-VID. We plan to first release our model, inference and evaluation code and then the rest.
For a better understanding of our training and evaluation process, we suggest running through codes from LLaMA-VID first.

Contents

Install

Please follow the instructions below to install the required packages. Our training process is mainly based on LLaMA-VID. And our short video evaluation process is mainly based on quantitative_evaluation from Video-ChatGPT.

  1. Clone this repository
git clone https://github.com/Deaddawn/MovieLLM-code.git
  1. Clone LLaMA-VID repository
cd MovieLLM-code
git clone https://github.com/dvlab-research/LLaMA-VID.git
mv eval_movie_qa.py calculate.py LLaMA-VID
mv run_llamavid_movie_answer.py LLaMA-VID/llamavid/serve
  1. Install Package
conda create -n MovieLLM python=3.10 -y
conda activate MovieLLM
cd LLaMA-VID
pip install -e .
  1. Install additional packages for video training
pip install ninja
pip install flash-attn --no-build-isolation

Model

We provide our baseline model and model trained on our generated dataset. All models are trained on stage3 of LLaMA-VID. For more detailed information, please refer to LLaMA-VID-model

Type Max Token Base LLM Finetuning Data Finetuning schedule Download
Long video 64K Vicuna-7B-v1.5 LLaVA1.5-VideoChatGPT-Instruct + LongVideoQA full_ft-1e ckpt
Long video 64K Vicuna-7B-v1.5 LLaVA1.5-VideoChatGPT-Instruct + LongVideoQA+MovieLLMQA full_ft-1e ckpt

Preparation

This section is mainly used to demonstrate how to set up the data and model environment related to llamavid. Again, we suggest running through from the original LLaMA-VID-preparation. We write this section based on the above with some alteration.

Dataset

We provide raw dataset generated from our pipeline and also related training data based on LLaMA-VID.

Our Raw Data

Data generated from our pipeline consists of key frame images, corresponding QAs and dialogues. You can download it from here MovieLLM-Data

Training Data

To run training process on LLaMA-VID stage-3, processed video data and corresponding QA pairs are needed:

(1) Processed Video Data

We first preprocess the raw data from MovieNet (used in LLaMA-VID original paper) and the raw data generated from our pipeline.

For data preprocessing from MovieNet, please first download the long video data from MovieNet, shot detection results from here. Place shot detection results under LLaMA-VID-Finetune/movienet/files before preprocessing. Then please follow the preprocess-instruct to preprocess your data

For processed data from ours, please download it from here MovieLLM-feat (coming soon).

(2) Corresponding QA Pairs

For correspongding QA pairs, please download it from here:

Data file name Size
long_videoqa_base.json 240MB
long_videoqa_ours.json 245MB

Pretrained Weights

Please download the pretrained weights from the following link EVA-ViT-G, QFormer-7b

Structure

Please organize the video data, QA pairs and weights as the following structure:

LLaMA-VID
├── llamavid
├── scripts
├── work_dirs
│   ├── llama-vid
│   │   ├── llama-vid-7b-full-224-long-video-MovieLLM
│   │   ├── llama-vid-7b-full-224-long-video-baseline
├── model_zoo
│   ├── LAVIS
│   │   ├── eva_vit_g.pth
│   │   ├── instruct_blip_vicuna7b_trimmed.pth
├── data
│   ├── LLaMA-VID-Finetune
│   │   ├── long_videoqa_base.json
│   │   ├── long_videoqa_ours.json
│   │   ├── movienet
│   │   ├── story_feat
│   ├── LLaMA-VID-Eval
│   │   ├── MSRVTT-QA
│   │   ├── MSVD-QA
│   │   ├── video-chatgpt

Pipeline

Coming soon.

Training

Coming soon.

Inference

For long-video inference on LLaMA-VID, please follow LLaMA-VID-Long-video-preprocess to process your video. Then, please try this for long video inference:

cd LLaMA-VID
python llamavid/serve/run_llamavid_movie.py \
    --model-path work_dirs/llama-vid/llama-vid-7b-full-224-long-video \
    --video-file <path to your processed video file> \
    --load-4bit

Evaluation

We perform evaluation on both short video and long video.

Short video

For short video evaluation, please download the evaluation data following Preparation and organize them as in Structure.

Results for short video

Model MSVD-QA MSVD-QA Score MSRVTT-QA MSRVTT-QA Score Correctness Detail Context Temporal Consistency
Baseline 49.3 3.169 43.5 2.865 1.94 2.431 2.701 1.585 1.699
Ours 56.7 3.46 51.3 3.141 2.154 2.549 2.88 1.832 1.976

For MSVD-QA evaluation:

bash scripts/video/eval/msvd_eval.sh

For MSRVTT-QA evaluation:

bash scripts/video/eval/msrvtt_eval.sh

Long video

To run long video evaluation, please first download corresponding test-data and QAs.

Then run the following to generate answers for two models (our evaluation methods compare two answers based on reference answer)

python llamavid/serve/run_llamavid_movie_answer.py --model-path <your-model-path> --video-file <test-data-path> --output_path <path-for-saving-answers> --load-4bit --meta_path <QA-path>

Note that in this paper, we run the above for both baseline model and models trained on our data. So, basically, you should have two folders for answers of both models.

Now, you should have following three folders for ground truth, prediction from model 1, prediction from model 2 like the following:

res
|-- baseline
|-- ground_truth
|-- ours

Then run

 python eval_movie_qa.py --output_dir ./test/compare_res --api_key <your-api-key> --gt_dir ./res/ground_truth --method_dir ./res/ours --base_dir ./res/basline

Finally

python calculate.py --path ./test/compare_res 

Results for long video

Results

Generation Results

Comparison Results

Citation

If you find our work useful, please consider citing:

@misc{song2024moviellm,
      title={MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies}, 
      author={Zhende Song and Chenchen Wang and Jiamu Sheng and Chi Zhang and Gang Yu and Jiayuan Fan and Tao Chen},
      year={2024},
      eprint={2403.01422},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

We would like to thank the following repos for their great work: