The official PyTorch implementation of the paper "GMD: Controllable Human Motion Synthesis via Guided Diffusion Models".
For more details, visit our project page.
📢
20/Dec/23 - We release DNO: Optimizing Diffusion Noise Can Serve As Universal Motion Priors, a follow-up work that looks at how to effectively use diffusion model and guidance to tackle many motion tasks.
28/July/23 - First release.
If you find this code useful in your research, please cite:
@inproceedings{karunratanakul2023gmd,
title = {Guided Motion Diffusion for Controllable Human Motion Synthesis},
author = {Karunratanakul, Korrawe and Preechakul, Konpat and Suwajanakorn, Supasorn and Tang, Siyu},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages = {2151--2162},
year = {2023}
}
This code was tested on Ubuntu 20.04 LTS
and requires:
- Python 3.7
- conda3 or miniconda3
- CUDA capable GPU (one is enough)
Install ffmpeg (if not already installed):
sudo apt update
sudo apt install ffmpeg
For windows use this instead.
GMD shares a large part of its base dependencies with the MDM. However, you might find it easier to install our dependencies from scratch due to some key version differences.
Setup conda env:
conda env create -f environment_gmd.yml
conda activate gmd
conda remove --force ffmpeg
python -m spacy download en_core_web_sm
pip install git+https://github.com/openai/CLIP.git
Download dependencies:
Text to Motion
bash prepare/download_smpl_files.sh
bash prepare/download_glove.sh
bash prepare/download_t2m_evaluators.sh
Unconstrained
bash prepare/download_smpl_files.sh
bash prepare/download_recognition_unconstrained_models.sh
There are two paths to get the data:
(a) Generation only wtih pretrained text-to-motion model without training or evaluating
(b) Get full data to train and evaluate the model.
HumanML3D - Clone HumanML3D, then copy the data dir to our repository:
cd ..
git clone https://github.com/EricGuo5513/HumanML3D.git
unzip ./HumanML3D/HumanML3D/texts.zip -d ./HumanML3D/HumanML3D/
cp -r HumanML3D/HumanML3D guided-motion-diffusion/dataset/HumanML3D
cd guided-motion-diffusion
cp -a dataset/HumanML3D_abs/. dataset/HumanML3D/
[Important !]
Because we change the representation of the root joint from relative to absolute, you need to replace the original files and run our version of motion_representation.ipynb
and cal_mean_variance.ipynb
provided in ./HumanML3D_abs/
instead to get the absolute-root data.
HumanML3D - Follow the instructions in HumanML3D, then copy the result dataset to our repository:
Then copy the data to our repository
cp -r ../HumanML3D/HumanML3D ./dataset/HumanML3D
Download both models, then unzip and place them in ./save/
.
Both models are trained on the HumanML3D dataset.
Text to Motion - Without spatial conditioning
This part is a standard text-to-motion generation.
Note: We change the behavior of the --num_repetitions
flag from the original MDM repo to facilitate the two-staged pipeline and logging. We only support --num_repetitions 1
at this moment.
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --num_samples 10
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --input_text ./assets/example_text_prompts.txt
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --text_prompt "a person is picking up something on the floor"
Text to Motion - With keyframe locations conditioning
The predefined pattern can be found in get_kframes()
in sample/keyframe_pattern.py
. You can add more patterns there using the same format [(frame_num_1, (x_1, z_1)), (frame_num_2, (x_2, z_2)), ...]
where x
and z
are the location of the root joint on the plane in the world coordinate system.
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --text_prompt "a person is walking while raising both hands" --guidance_mode kps
(In development) Using the --interactive
flag will start an interactive window that allows you to choose the keyframes yourself. The interactive pattern will override the predefined pattern.
Text to Motion - With keyframe locations conditioning and obstacle avoidance
Similarly, the pattern is defined in get_obstacles()
in sample/keyframe_pattern.py
. You can add more patterns using the format ((x, z), radius)
currently we only support circle obstacle due to the ease of defining SDF, but you can add any shape with valid SDF.
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --text_prompt "a person is walking while raising both hands" --guidance_mode sdf --seed 11
Text to Motion - With trajectory conditioning
The trajectory-conditioned generation is a special case of keyframe-conditioned generation, where all the frames are keyframes.
The sample trajectory we used can be found in ./save/template_joints.npy
. You can also use your own trajectory by providing the list of ground_positions
.
python -m sample.generate --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --text_prompt "a person is walking while raising both hands" --guidance_mode trajectory
(In development) Using the --interactive
flag will start an interactive window that allows you to draw a trajectory that will override the predefined pattern.
You may also define:
--device
id.--seed
to sample different prompts.--motion_length
(text-to-motion only) in seconds (maximum is 9.8[sec]).--progress
to save the denosing progress.
Running those will get you:
results.npy
file with text prompts and xyz positions of the generated animationsample##_rep##.mp4
- a stick figure animation for each generated motion.trajec_##_####
- a plot of the trajectory at each denoising step of the trajectory model. The final trajectory is then used to generate the motion.motion_trajec_##_####
- a plot of the trajectory of the generated motion at each denoising step of the motion model.
You can stop here, or render the SMPL mesh using the following script.
To create SMPL mesh per frame run:
python -m visualize.render_mesh --input_path /path/to/mp4/stick/figure/file
This script outputs:
sample##_rep##_smpl_params.npy
- SMPL parameters (thetas, root translations, vertices and faces)sample##_rep##_obj
- Mesh per frame in.obj
format.
Notes:
- The
.obj
can be integrated into Blender/Maya/3DS-MAX and rendered using them. - This script is running SMPLify and needs GPU as well (can be specified with the
--device
flag). - Important - Do not change the original
.mp4
path before running the script.
Notes for 3d makers:
- You have two ways to animate the sequence:
- Use the SMPL add-on and the theta parameters saved to
sample##_rep##_smpl_params.npy
(we always use beta=0 and the gender-neutral model). - A more straightforward way is using the mesh data itself. All meshes have the same topology (SMPL), so you just need to keyframe vertex locations.
Since the OBJs are not preserving vertices order, we also save this data to the
sample##_rep##_smpl_params.npy
file for your convenience.
- Use the SMPL add-on and the theta parameters saved to
GMD is trained on the HumanML3D dataset.
python -m train.train_trajectory
python -m train.train_gmd
Essentially, the same command is used for both the trajectory model and the motion model. You can select which model to train by changing the train_args
. The training options can be found in ./configs/card.py
.
- Use
--device
to define GPU id. - Add
--train_platform_type {ClearmlPlatform, TensorboardPlatform}
to track results with either ClearML or Tensorboard.
All evaluation are done on the HumanML3D dataset.
- Takes about 20 hours (on a single GPU)
- The output of this script for the pre-trained models (as was reported in the paper) is provided in the checkpoints zip file.
python -m eval.eval_humanml --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt
For each prompt, we use the ground truth trajectory as conditions.
python -m eval.eval_humanml --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt --full_traj_inpaint
For each prompt, 5 keyframes are sampled from the ground truth motion. The ground locations of the root joint in those frames are used as conditions.
python -m eval.eval_humanml_condition --model_path ./save/unet_adazero_xl_x0_abs_proj10_fp16_clipwd_224/model000500000.pt
We would like to thank the following contributors for the great foundation that we build upon:
MDM, guided-diffusion, MotionCLIP, text-to-motion, actor, joints2smpl, MoDi.
This code is distributed under an MIT LICENSE.
Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.