Skip to content

Latest commit

 

History

History
146 lines (111 loc) · 8.45 KB

README.md

File metadata and controls

146 lines (111 loc) · 8.45 KB

CogVideoX Diffusers Fine-tuning Guide

中文阅读

日本語で読む

If you're looking for the fine-tuning instructions for the SAT version, please check here. The dataset format for this version differs from the one used here.

Hardware Requirements

Model Training Type Distribution Strategy Mixed Precision Training Resolution (FxHxW) Hardware Requirements
cogvideox-t2v-2b lora (rank128) DDP fp16 49x480x720 16GB VRAM (NVIDIA 4080)
cogvideox-{t2v, i2v}-5b lora (rank128) DDP bf16 49x480x720 24GB VRAM (NVIDIA 4090)
cogvideox1.5-{t2v, i2v}-5b lora (rank128) DDP bf16 81x768x1360 35GB VRAM (NVIDIA A100)
cogvideox-t2v-2b sft DDP fp16 49x480x720 36GB VRAM (NVIDIA A100)
cogvideox-t2v-2b sft 1-GPU zero-2 + opt offload fp16 49x480x720 17GB VRAM (NVIDIA 4090)
cogvideox-t2v-2b sft 8-GPU zero-2 fp16 49x480x720 17GB VRAM (NVIDIA 4090)
cogvideox-t2v-2b sft 8-GPU zero-3 fp16 49x480x720 19GB VRAM (NVIDIA 4090)
cogvideox-t2v-2b sft 8-GPU zero-3 + opt and param offload bf16 49x480x720 14GB VRAM (NVIDIA 4080)
cogvideox-{t2v, i2v}-5b sft 1-GPU zero-2 + opt offload bf16 49x480x720 42GB VRAM (NVIDIA A100)
cogvideox-{t2v, i2v}-5b sft 8-GPU zero-2 bf16 49x480x720 42GB VRAM (NVIDIA 4090)
cogvideox-{t2v, i2v}-5b sft 8-GPU zero-3 bf16 49x480x720 43GB VRAM (NVIDIA 4090)
cogvideox-{t2v, i2v}-5b sft 8-GPU zero-3 + opt and param offload bf16 49x480x720 28GB VRAM (NVIDIA 5090)
cogvideox1.5-{t2v, i2v}-5b sft 1-GPU zero-2 + opt offload bf16 81x768x1360 56GB VRAM (NVIDIA A100)
cogvideox1.5-{t2v, i2v}-5b sft 8-GPU zero-2 bf16 81x768x1360 55GB VRAM (NVIDIA A100)
cogvideox1.5-{t2v, i2v}-5b sft 8-GPU zero-3 bf16 81x768x1360 55GB VRAM (NVIDIA A100)
cogvideox1.5-{t2v, i2v}-5b sft 8-GPU zero-3 + opt and param offload bf16 81x768x1360 40GB VRAM (NVIDIA A100)

Install Dependencies

Since the relevant code has not yet been merged into the official diffusers release, you need to fine-tune based on the diffusers branch. Follow the steps below to install the dependencies:

git clone https://github.com/huggingface/diffusers.git
cd diffusers # Now on the Main branch
pip install -e .

Prepare the Dataset

First, you need to prepare your dataset. Depending on your task type (T2V or I2V), the dataset format will vary slightly:

.
├── prompts.txt
├── videos
├── videos.txt
├── images     # (Optional) For I2V, if not provided, first frame will be extracted from video as reference
└── images.txt # (Optional) For I2V, if not provided, first frame will be extracted from video as reference

Where:

  • prompts.txt: Contains the prompts
  • videos/: Contains the .mp4 video files
  • videos.txt: Contains the list of video files in the videos/ directory
  • images/: (Optional) Contains the .png reference image files
  • images.txt: (Optional) Contains the list of reference image files

You can download a sample dataset ( T2V) Disney Steamboat Willie.

If you need to use a validation dataset during training, make sure to provide a validation dataset with the same format as the training dataset.

Running Scripts to Start Fine-tuning

Before starting training, please note the following resolution requirements:

  1. The number of frames must be a multiple of 8 plus 1 (i.e., 8N+1), such as 49, 81 ...
  2. Recommended video resolutions for each model:
    • CogVideoX: 480x720 (height x width)
    • CogVideoX1.5: 768x1360 (height x width)
  3. For samples (videos or images) that don't match the training resolution, the code will directly resize them. This may cause aspect ratio distortion and affect training results. It's recommended to preprocess your samples (e.g., using crop + resize to maintain aspect ratio) before training.

Important Note: To improve training efficiency, we automatically encode videos and cache the results on disk before training. If you modify the data after training, please delete the latent directory under the video directory to ensure the latest data is used.

LoRA

# Modify configuration parameters in train_ddp_t2v.sh
# Main parameters to modify:
# --output_dir: Output directory
# --data_root: Dataset root directory
# --caption_column: Path to prompt file
# --image_column: Optional for I2V, path to reference image file list (remove this parameter to use the first frame of video as image condition)
# --video_column: Path to video file list
# --train_resolution: Training resolution (frames x height x width)
# For other important parameters, please refer to the launch script

bash train_ddp_t2v.sh  # Text-to-Video (T2V) fine-tuning
bash train_ddp_i2v.sh  # Image-to-Video (I2V) fine-tuning

SFT

We provide several zero configuration templates in the configs/ directory. Please choose the appropriate training configuration based on your needs (configure the deepspeed_config_file option in accelerate_config.yaml).

# Parameters to configure are the same as LoRA training

bash train_zero_t2v.sh  # Text-to-Video (T2V) fine-tuning
bash train_zero_i2v.sh  # Image-to-Video (I2V) fine-tuning

In addition to setting the bash script parameters, you need to set the relevant training options in the zero configuration file and ensure the zero training configuration matches the parameters in the bash script, such as batch_size, gradient_accumulation_steps, mixed_precision. For details, please refer to the DeepSpeed official documentation

When using SFT training, please note:

  1. For SFT training, model offload is not used during validation, so the peak VRAM usage may exceed 24GB. For GPUs with less than 24GB VRAM, it's recommended to disable validation.

  2. Validation is slow when zero-3 is enabled, so it's recommended to disable validation when using zero-3.

Load the Fine-tuned Model

  • Please refer to cli_demo.py for instructions on how to load the fine-tuned model.

  • For SFT trained models, please first use the zero_to_fp32.py script in the checkpoint-*/ directory to merge the model weights

Best Practices

  • We included 70 training videos with a resolution of 200 x 480 x 720 (frames x height x width). Through frame skipping in the data preprocessing, we created two smaller datasets with 49 and 16 frames to speed up experiments. The maximum frame count recommended by the CogVideoX team is 49 frames. These 70 videos were divided into three groups: 10, 25, and 50 videos, with similar conceptual nature.
  • Videos with 25 or more frames work best for training new concepts and styles.
  • It's recommended to use an identifier token, which can be specified using --id_token, for better training results. This is similar to Dreambooth training, though regular fine-tuning without using this token will still work.
  • The original repository uses lora_alpha set to 1. We found that this value performed poorly in several runs, possibly due to differences in the model backend and training settings. Our recommendation is to set lora_alpha to be equal to the rank or rank // 2.
  • It's advised to use a rank of 64 or higher.