An innovative method designed to augment the capabilities of existing video diffusion models that can:
1️⃣ utilize multiple reference videos to achieve a broader spectrum of action imitation and generate novel actions without fine-tuning;
2️⃣ distill effective and related visual motion features instead of replicating the referred content.
input text | Original VideoCrafter2 | + EchoReel |
"A man is studying in the library" | ||
"A man is skiing" | ||
"A man is running" | ||
"Couple walking on the beach" | ||
"A man is carving a stone statue" |
- [2024.4.21] Release pretrain weight
- [2024.3.18] Release train and inference code
- Release code of LVDM text-to-video with EchoReel
- Release training code
- Release pretrained weight
- Release image-to-video VideoCrafter code with EchoReel
Please prepare .json data in the following format:
[
{
"input_text": ...,
"gt_video_path": ...,
"reference_text": ...,
"reference_video_path": ...
},
...
]
Install Environment via Anaconda
conda create -n EchoReel python=3.10.13
conda activate EchoReel
pip install -r requirements.txt
Please ensure the pretrained weights are downloaded from our Hugging Face repository and subsequently placed in the designated 'checkpoint' folder. To optimize functionality, it is strongly advised to download the WebVid .csv file into the specified 'dataset' directory, thereby enabling seamless automatic reference video selection.
mkdir checkpoint
cd checkpoint
wget https://huggingface.co/cscrisp/EchoReel/resolve/main/checkpoint/checkpoint.pt
cd ..
mkdir dataset
cd datset
wget wget http://www.robots.ox.ac.uk/~maxbain/webvid/results_10M_train.csv
cd ..
python gr.py
% use original LVDM pretrain weight to initialize model
wget -O models/t2v/model.ckpt https://huggingface.co/Yingqing/LVDM/resolve/main/lvdm_short/t2v.ckpt
bash train_EchoReel.sh
bash sample_EchoReel.sh
@article{Liu2024EchoReel,
title={EchoReel: Enhancing Action Generation of Existing Video Diffusion Models},
author={Jianzhi Liu, Junchen Zhu, Lianli Gao, Jingkuan Song},
year={2024},
eprint={2403.11535},
archivePrefix={arXiv},
}
We built our code partially based on latent video diffusion models. Thanks for their wonderful work!