Penghui Ruan1,2, Pichao Wang3, Divya Saxena1, Jiannong Cao1, Yuhui Shi2
1 The Hong Kong Polytechnic University, Hong Kong
2 Southern University of Science and Technology, Shenzhen
3 Amazon, United States
Accepted at NeurIPS 2024
Lavie |
VideoCrafter2 |
ModelScope |
DEMO |
---|
Lavie |
VideoCrafter2 |
ModelScope |
DEMO |
---|
Lavie |
VideoCrafter2 |
ModelScope |
DEMO |
---|
To save videos, FFmpeg is required. Install it using the following command:
sudo apt-get update && sudo apt-get install ffmpeg libsm6 libxext6 -y
Clone the repository:
git clone git@github.com:PR-Ryan/DEMO.git
Set up the Python environment:
conda create -n demo python=3.8
conda activate demo
pip install -r requirements.txt
Note: Our
requirements.txt
specifiestorch==2.1.2
, compiled withnvcc 12.1
. You may adjust this according to your setup, but ensure that yourtorch
installation is compatible with thenvcc
version installed on your system. For more details, refer to the PyTorch installation guide.
To download pretrained models, run the following command:
bash models/download.sh
Alternatively, you can download directly from Hugging Face and place the downloaded folder in models/modelscopet2v
.
Download DEMO checkpoints from Hugging Face and place the folder under models
.
Create an inference prompt file at prompts/test_prompt.csv
. Here’s an example format:
id,prompt
1,a fat dog is playing in the yard.
2,a fat car is parked by the road.
3,a fat balloon is floating in the air.
To start inference, run:
bash scripts/inference_deepspeed.sh
By default, distributed inference is enabled and all available GPUs are used. To manually specify GPUs, add the --include
flag in the DeepSpeed command:
--include="localhost:<your gpu ids>"
All configurations for inference are found in configs/t2v_inference_deepspeed.yaml
. In this file, you can adjust the following settings:
infer_dataset
: Specify your dataset type and prompt path.batch_size
: Set the batch size for diffusion sampling.decoder_bs
: Define the batch size for VAE decoding.pretrained
: Set checkpoint paths for pretrained models.
The DeepSpeed configurations for inference are located in ds_configs/ds_config_inference.json
. You can also use a custom DeepSpeed configuration by modifying the deepspeed_config
setting in configs/t2v_inference_deepspeed.yaml
.
With our optimized inference code, this model can generate video at 256x256 resolution with 16 frames on an 8GB GPU with a batch size of 1.
Follow the instructions to download the WebVid-10M dataset. We provide an example training dataset in data/webvid/train_sample.csv
. You can manually download these sample videos and place them in data/webvid/videos
for sample training.
If you prefer to use your own dataset, refer to tools/datasets/video_datasets.py
to define your dataset and preprocessing steps.
bash models/download.sh
You can also direcly download from huggingface and place the folder as models/modelscopet2v
To train the model, run the following command:
bash scripts/train_deepspeed.sh
By default, data distributed parallel training is used, utilizing all available GPUs. If you want to manually specify the GPUs, add the --include
flag to the DeepSpeed command:
--include="localhost:<gpu_ids>"
All training configurations are in the configs/t2v_train_deepspeed.yaml
file. You can customize the following settings:
train_dataset
: Define your dataset type and provide the prompt path.pretrained
: Specify the checkpoint paths for pretrained models.
The DeepSpeed configurations for training are located in ds_configs/ds_config_train.json
. You can customize these settings or provide your own DeepSpeed configuration by modifying the deepspeed_config
parameter in configs/t2v_train_deepspeed.yaml
.
In ds_config/ds_config_train.json
, you can specify:
train_micro_batch_size_per_gpu
: The batch size for each GPU.gradient_accumulation_steps
: Number of steps for gradient accumulation.zero_optimization
: Configurations for DeepSpeed's ZeRO optimization. By default, we use stage 2 with optimizer offloading to the CPU, which may increase CPU memory usage. Disable this if you have limited CPU memory. If your GPUs have large memory, you can switch to stage 1 for faster convergence.optimizer
: By default, we use DeepSpeed's highly optimized CPU Adam for faster training, which requires compiling withnvcc
during the first run. You may need to setCUDA_HOME
andLD_LIBRARY_PATH
environment variables. Alternatively, you can simply skip this by switching to another optimizer inds_config/ds_config_train.json
. Refer to the DeepSpeed documentation for more information.
Note: Ensure that your
nvcc
version matches the version used to compile PyTorch. If it does not, you can installnvcc
within your Conda environment and set theCUDA_HOME
andLD_LIBRARY_PATH
to point to the Conda-installednvcc
. For more details, refer to the CUDA Installation Guide.
TensorBoard is enabled by default for monitoring the training process. To view the training progress, launch TensorBoard with:
tensorboard --logdir=tensorboard_log/demo
- Release model weights.
- Release inference and training code.
- Huggingface demo.
- Gradio application.
Distributed under the MIT License. See LICENSE.txt
for more information.
Penghui Ruan - penghui.ruan@connect.polyu.hk
Project Link: https://pr-ryan.github.io/DEMO-project/
This repository is largely based on VGen by Alibaba. We sincerely thank them for their contributions to the open-source community.
@misc{ruan2024enhancingmotiontexttovideogeneration,
title={Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning},
author={Penghui Ruan and Pichao Wang and Divya Saxena and Jiannong Cao and Yuhui Shi},
year={2024},
eprint={2410.24219},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.24219},
}