Skip to content

We present StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses.

License

Notifications You must be signed in to change notification settings

Francis-Rings/StableAnimator

Repository files navigation

StableAnimator

StableAnimator: High-Quality Identity-Preserving Human Image Animation
Shuyuan Tu1, Zhen Xing1, Xintong Han3, Zhi-Qi Cheng4, Qi Dai2, Chong Luo2, Zuxuan Wu1
[1Fudan University; 2Microsoft Research Asia; 3Huya Inc; 4Carnegie Mellon University]


Pose-driven Human image animations generated by StableAnimator, showing its power to synthesize high-fidelity and ID-preserving videos. All animations are directly synthesized by StableAnimator without the use of any face-related post-processing tools, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.


Comparison results between StableAnimator and state-of-the-art (SOTA) human image animation models highlight the superior performance of StableAnimator in delivering high-fidelity, identity-preserving human image animation.

Overview

model architecture
The overview of the framework of StableAnimator.

Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.

News

  • [2024-12-13]:πŸ”₯ The training code and training tutorial are released! You can train/finetune your own StableAnimator on your own collected datasets! Other codes will be released very soon. Stay tuned!
  • [2024-12-10]:πŸ”₯ The gradio interface is released! Many thanks to @gluttony-10 for his contribution! Other codes will be released very soon. Stay tuned!
  • [2024-12-6]:πŸ”₯ All data preprocessing codes (human skeleton extraction and human face mask extraction) are released! The training code and detailed training tutorial will be released before 2024.12.13. Stay tuned!
  • [2024-12-4]:πŸ”₯ We are thrilled to release an interesting dance demo (πŸ”₯πŸ”₯APT DanceπŸ”₯πŸ”₯)! The generated video can be seen on YouTube and Bilibili.
  • [2024-11-28]:πŸ”₯ The data pre-processing codes (human skeleton extraction) are available! Other codes will be released very soon. Stay tuned!
  • [2024-11-26]:πŸ”₯ The project page, code, technical report and a basic model checkpoint are released. Further training codes, data pre-processing codes, the evaluation dataset and StableAnimator-pro will be released very soon. Stay tuned!

To-Do List

  • StableAnimator-basic
  • Inference Code
  • Evaluation Samples
  • Data Pre-Processing Code (Skeleton Extraction)
  • Data Pre-Processing Code (Human Face Mask Extraction)
  • Training Code
  • Evaluation Dataset
  • StableAnimator-pro
  • Inference Code with HJB-based Face Optimization

Quickstart

For the basic version of the model checkpoint, it supports generating videos at a 576x1024 or 512x512 resolution. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames.

Environment setup

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install torch==2.5.1+cu124 xformers --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

Download weights

If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: export HF_ENDPOINT=https://hf-mirror.com. Please download weights manually as follows:

cd StableAnimator
git lfs install
git clone https://huggingface.co/FrancisRing/StableAnimator checkpoints

All the weights should be organized in models as follows The overall file structure of this project should be organized as follows:

StableAnimator/
β”œβ”€β”€ DWPose
β”œβ”€β”€ animation
β”œβ”€β”€ checkpoints
β”‚Β Β  β”œβ”€β”€ DWPose
β”‚Β Β  β”‚Β   β”œβ”€β”€ dw-ll_ucoco_384.onnx
β”‚Β Β  β”‚Β Β  └── yolox_l.onnx
β”‚Β Β  β”œβ”€β”€ Animation
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ pose_net.pth
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ face_encoder.pth
β”‚Β Β  β”‚Β Β  └── unet.pth
β”‚Β Β  β”œβ”€β”€ SVD
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ feature_extractor
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ image_encoder
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ scheduler
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ unet
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ vae
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ model_index.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ svd_xt.safetensors
β”‚Β Β  β”‚Β Β  └── svd_xt_image_decoder.safetensors
β”‚Β Β  └── inference.zip
β”œβ”€β”€ models
β”‚   β”‚   └── antelopev2
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 1k3d68.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 2d106det.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ genderage.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ glintr100.onnx
β”‚Β Β  β”‚Β Β      └── scrfd_10g_bnkps.onnx
β”œβ”€β”€ app.py
β”œβ”€β”€ command_basic_infer.sh
β”œβ”€β”€ inference_basic.py
β”œβ”€β”€ requirement.txt 

Notably, there is a bug in the automatic download process of Antelopev2, with the error details described as follows:

Traceback (most recent call last):
  File "/home/StableAnimator/inference_normal.py", line 243, in <module>
    face_model = FaceModel()
  File "/home/StableAnimator/animation/modules/face_model.py", line 11, in __init__
    self.app = FaceAnalysis(
  File "/opt/conda/lib/python3.10/site-packages/insightface/app/face_analysis.py", line 43, in __init__
    assert 'detection' in self.models
AssertionError

This issue is related to the incorrect path of Antelopev2, which is automatically downloaded into the models/antelopev2/antelopev2 directory. The correct path of Antelopev2 should be models/antelopev2. You can run the following commands to tackle this issue:

cd StableAnimator
mv ./models/antelopev2/antelopev2 ./models/tmp
rm -rf ./models/antelopev2
mv ./models/tmp ./models/antelopev2

Evaluation Samples

The evaluation samples presented in the paper can be downloaded from OneDrive or inference.zip in checkpoints. Please download evaluation samples manually as follows:

cd StableAnimator
mkdir inference

All the evaluation samples should be organized as follows:

inference/
β”œβ”€β”€ case-1
β”‚Β Β  β”œβ”€β”€ poses
β”‚Β Β  β”œβ”€β”€ faces
β”‚Β Β  └── reference.png
β”œβ”€β”€ case-2
β”‚Β Β  β”œβ”€β”€ poses
β”‚Β Β  β”œβ”€β”€ faces
β”‚Β Β  └── reference.png
β”œβ”€β”€ case-3
β”‚Β Β  β”œβ”€β”€ poses
β”‚Β Β  β”œβ”€β”€ faces
β”‚Β Β  └── reference.png

Human Skeleton Extraction

We leverage the pre-trained DWPose to extract the human skeletons. In the initialization of DWPose, the pretrained weights should be configured in /DWPose/dwpose_utils/wholebody.py:

onnx_det = 'path/checkpoints/DWPose/yolox_l.onnx'
onnx_pose = 'path/checkpoints/DWPose/dw-ll_ucoco_384.onnx'

Given the target image folder containing multiple .png files, you can use the following command to obtain the corresponding human skeleton images:

python DWPose/skeleton_extraction.py --target_image_folder_path="path/test/target_images" --ref_image_path="path/test/reference.png" --poses_folder_path="path/test/poses"

It is worth noting that the .png files in the target image folder are named in the format frame_i.png, such as frame_0.png, frame_1.png, and so on. --ref_image_path refers to the path of the given reference image. The obtained human skeleton images are saved in path/test/poses. It is particularly significant that the target skeleton images should be aligned with the reference image regarding the body shape.

If you only have the target MP4 file (target.mp4), we recommend you to use ffmpeg to convert the MP4 file to multiple frames (.png files) without any quality loss.

ffmpeg -i target.mp4 -q:v 1 -start_number 0 path/test/target_images/frame_%d.png

The obtained frames are saved in path/test/target_images.

Human Face Mask Extraction

Given the path to an image folder containing multiple RGB .png files, you can run the following command to extract the corresponding human face masks:

python face_mask_extraction.py --image_folder="path/StableAnimator/inference/your_case/target_images"

path/StableAnimator/inference/your_case/target_images contains multiple .png files. The obtained masks are saved in path/StableAnimator/inference/your_case/faces.

Model inference

A sample configuration for testing is provided as command_basic_infer.sh. You can also easily modify the various configurations according to your needs.

bash command_basic_infer.sh

StableAnimator supports human image animation at two different resolution settings: 512x512 and 576x1024. You can modify "--width" and "--height" in command_basic_infer.sh to set the resolution of the animation. "--output_dir" in command_basic_infer.sh refers to the saved path of the generated animation. "--validation_control_folder" and "--validation_image" in command_basic_infer.sh refer to the paths of the given pose sequence and the reference image, respectively. "--pretrained_model_name_or_path" in command_basic_infer.sh is the path of pretrained SVD. "posenet_model_name_or_path", "face_encoder_model_name_or_path", and "unet_model_name_or_path" in command_basic_infer.sh refer to paths of pretrained StableAnimator weights. If you have enough GPU resources, you can increase the value (4=>8=>16) of "--decode_chunk_size" in command_basic_infer.sh to promote the temporal smoothness of the animation.

Tips: if your GPU memory is limited, you can reduce the number of animated frames. This command will generate two files: animated_images and animated_images.gif. If you want to obtain the high quality MP4 file, we recommend you to leverage ffmpeg on the animated_images as follows:

cd animated_images
ffmpeg -framerate 20 -i frame_%d.png -c:v libx264 -crf 10 -pix_fmt yuv420p /path/animation.mp4

"-framerate" refers to the fps setting. "-crf" indicates the quality of the generated MP4 file, with smaller values corresponding to higher quality. Additionally, you can also run the following command to launch a Gradio interface:

python app.py

Model Training

πŸ”₯It’s worth noting that if you’re looking to train a conditioned Stable Video Diffusion (SVD) model, this training tutorial will also be helpful.πŸ”₯ For the training dataset, it has to be organized as follows:

animation_data/
β”œβ”€β”€ rec
β”‚Β Β  β”‚Β Β β”œβ”€β”€00001
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€images
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_0.png
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_1.png
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_2.png
β”‚Β Β  │  │  │  └──...
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€faces
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_0.png
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_1.png
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_2.png
β”‚Β Β  │  │  │  └──...
β”‚Β Β  │  │  └──poses
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_0.png
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_1.png
β”‚Β Β  β”‚Β Β β”‚Β Β β”‚Β Β β”œβ”€β”€frame_2.png
β”‚Β Β  │  │  │  └──...
β”‚Β Β  β”‚Β Β β”œβ”€β”€00002
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€images
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€faces
β”‚Β Β  │  │  └──poses
β”‚Β Β  β”‚Β Β β”œβ”€β”€00003
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€images
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€faces
β”‚Β Β  │  │  └──poses
β”‚Β Β  │  └──...
β”œβ”€β”€ vec
β”‚Β Β  β”‚Β Β β”œβ”€β”€00001
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€images
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€faces
β”‚Β Β  │  │  └──poses
β”‚Β Β  β”‚Β Β β”œβ”€β”€00002
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€images
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€faces
β”‚Β Β  │  │  └──poses
β”‚Β Β  β”‚Β Β β”œβ”€β”€00003
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€images
β”‚Β Β  β”‚Β Β β”‚Β Β β”œβ”€β”€faces
β”‚Β Β  │  │  └──poses
β”‚Β Β  │  └──...
β”œβ”€β”€ video_rec_path.txt
└── video_vec_path.txt

StableAnimator is trained on mixed-resolution videos, with 512x512 videos stored in animation_data/rec and 576x1024 videos stored in animation_data/vec. Each folder in animation_data/rec or animation_data/vec contains three subfolders which contains multiple .png image files. All .png image files are named in the format frame_i.png, such as frame_0.png, frame_1.png, and so on. 00001, 00002, 00003 indicate individual video information. In terms of three subfolders, images, faces, and poses store RGB frames, corresponding human face masks, and corresponding human skeleton poses, respectively. video_rec_path.txt and video_vec_path.txt record folder paths of animation_data/rec and animation_data/vec, respectively. For example, the content of video_rec_path.txt is shown as follows:

path/StableAnimator/animation_data/rec/00001
path/StableAnimator/animation_data/rec/00002
path/StableAnimator/animation_data/rec/00003
path/StableAnimator/animation_data/rec/00004
path/StableAnimator/animation_data/rec/00005
path/StableAnimator/animation_data/rec/00006
...

If you only have raw videos, you can leverage ffmpeg to extract frames from raw videos and store them in the subfolder images.

ffmpeg -i raw_video_1.mp4 -q:v 1 -start_number 0 path/StableAnimator/animation_data/rec/00001/images/frame_%d.png

The obtained frames are saved in path/StableAnimator/animation_data/rec/00001/images.

For extracting the human skeleton poses, you can run the following command:

python DWPose/training_skeleton_extraction.py --root_path="path/StableAnimator/animation_data" --name="rec" --start=1 --end=500 

--root_path and --name refer to the root path of training datasets and the name of the dataset. --start and --end specify the starting and ending indices of the selected training dataset. For example, --name="rec" --start=1 --end=500 indicates that the skeleton extraction will start at path/StableAnimator/animation_data/rec/00001 and end at path/StableAnimator/animation_data/rec/00500.

For extraction details of corresponding face masks, please refer to the Human Face Mask Extraction section. When your dataset is organized exactly as outlined above, you can easily train your StableAnimator by running the following command:

bash command_train.sh

For the parameter details of command_train.sh, CUDA_VISIBLE_DEVICES refers to gpu devices. In my setting, I use 4 NVIDIA A100 80G to train StableAnimator (CUDA_VISIBLE_DEVICES=3,2,1,0). --pretrained_model_name_or_path and --output_dir refer to the pretrained SVD path and the checkpoint saved path of the trained StableAnimator. --data_root_path, --rec_data_path, and --vec_data_path are the root path of datasets, the path of video_rec_path.txt, and the path of video_vec_path.txt, respectively. validation_image_folder, validation_control_folder, and validation_image are paths of validation ground truths, validation driven skeleton poses, and the validation reference image. --sample_n_frames is the number of frames that StableAnimator processes in a single batch. --num_train_epochs is the training epoch number. It is worth noting that the default number of training epochs is set to infinite. You can manually terminate the training process once you observe that your StableAnimator has reached its peak performance. The overall file structure of StableAnimator at training is shown as follows:

StableAnimator/
β”œβ”€β”€ DWPose
β”œβ”€β”€ animation
β”œβ”€β”€ animation_data
β”‚Β Β  β”œβ”€β”€ rec
β”‚Β Β  β”œβ”€β”€ vec
β”‚Β Β  β”œβ”€β”€ video_rec_path.txt
β”‚Β Β  └── video_vec_path.txt
β”œβ”€β”€ validation
β”‚Β Β  β”œβ”€β”€ ground_truth
β”‚Β Β  β”‚Β   β”œβ”€β”€ frame_0.png
β”‚Β Β  β”‚Β   β”œβ”€β”€ frame_1.png
β”‚Β Β  β”‚Β   β”œβ”€β”€ frame_2.png
β”‚Β Β  β”‚Β   └── ...
β”‚Β Β  β”œβ”€β”€ poses
β”‚Β Β  β”‚Β   β”œβ”€β”€ frame_0.png
β”‚Β Β  β”‚Β   β”œβ”€β”€ frame_1.png
β”‚Β Β  β”‚Β   β”œβ”€β”€ frame_2.png
β”‚Β Β  β”‚Β   └── ...
β”‚Β Β  └── reference.png
β”œβ”€β”€ checkpoints
β”‚Β Β  β”œβ”€β”€ DWPose
β”‚Β Β  β”‚Β   β”œβ”€β”€ dw-ll_ucoco_384.onnx
β”‚Β Β  β”‚Β Β  └── yolox_l.onnx
β”‚Β Β  β”œβ”€β”€ Animation
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ pose_net.pth
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ face_encoder.pth
β”‚Β Β  β”‚Β Β  └── unet.pth
β”‚Β Β  β”œβ”€β”€ SVD
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ feature_extractor
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ image_encoder
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ scheduler
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ unet
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ vae
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ model_index.json
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ svd_xt.safetensors
β”‚Β Β  β”‚Β Β  └── svd_xt_image_decoder.safetensors
β”‚Β Β  └── inference.zip
β”œβ”€β”€ models
β”‚   β”‚   └── antelopev2
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 1k3d68.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ 2d106det.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ genderage.onnx
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ glintr100.onnx
β”‚Β Β  β”‚Β Β      └── scrfd_10g_bnkps.onnx
β”œβ”€β”€ app.py
β”œβ”€β”€ command_basic_infer.sh
β”œβ”€β”€ inference_basic.py
β”œβ”€β”€ train.py
β”œβ”€β”€ command_train.sh
└── requirement.txt 

It is worth noting that training StableAnimator requires approximately 70GB of VRAM due to the mixed-resolution (512x512 and 576x1024) training pipeline. However, if you train StableAnimator exclusively on 512x512 videos, the VRAM requirement is reduced to approximately 40GB. Additionally, The backgrounds of the selected training videos should remain static, as this helps the diffusion model calculate accurate reconstruction loss.

Regarding finetuning StableAnimator, you can run the following command:

bash command_finetune.sh

posenet_model_finetune_path, face_encoder_finetune_path, and unet_model_finetune_path in command_finetune.sh refer to paths of pretrained StableAnimator weights.

VRAM requirement and Runtime

For the 15s demo video (512x512, fps=30), the 16-frame basic model requires 8GB VRAM and finishes in 5 minutes on a 4090 GPU.

The minimum VRAM requirement for the 16-frame U-Net of the pro model is 10GB (576x1024, fps=30); however, the VAE decoder demands 16GB. You have the option to run the VAE decoder on CPU.

Contact

If you have any suggestions or find our work helpful, feel free to contact me

Email: francisshuyuan@gmail.com

If you find our work useful, please consider giving a star to this github repository and citing it:

@article{tu2024stableanimator,
  title={StableAnimator: High-Quality Identity-Preserving Human Image Animation},
  author={Shuyuan Tu and Zhen Xing and Xintong Han and Zhi-Qi Cheng and Qi Dai and Chong Luo and Zuxuan Wu},
  journal={arXiv preprint arXiv:2411.17697},
  year={2024}
}

About

We present StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •