Official PyTorch implementation for the paper:
ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE. (Accepted at ACM SIGGRAPH MIG 2024)
We propose ProbTalk3D, a VQ-VAE based probabilistic model for emotion controllable speech-driven 3D facial animation synthesis. ProbTalk3D first learns a motion prior using VQ-VAE codebook matching, then trains a speech and emotion conditioned network leveraging this prior. During inference, probabilistic sampling of latent codebook embeddings enables non-deterministic outputs.
Click to expand
- Linux and Windows (tested on Windows 10)
- Python 3.9+
- PyTorch 2.1.1
- CUDA 12.1 (GPU with at least 2.55GB VRAM)
To run our program, first create a virtual environment. We recommend using miniconda or miniforge. Once Miniconda or Miniforge is installed, open Command Prompt (make sure to run it as Administrator on Windows) and run the following commands:
conda create --name probtalk3d python=3.9
conda activate probtalk3d
pip install torch==2.1.1+cu121 torchvision==0.16.1+cu121 torchaudio==2.1.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html
Then, navigate to the project root
folder and execute:
pip install -r requirements.txt
Click to expand
Download 3DMEAD dataset following the instruction of EMOTE. This dataset represents facial animations using FLAME parameters.
-
Please refer to the
README.md
file indatasets/3DMEAD_preprocess/
folder. -
After processing, the resulting
*.npy
files will be located in thedatasets/mead/param
folder, and the.wav
files should be in thedatasets/mead/wav
folder. -
Optional Operation
Click to expand
For training the comparison model in vertex space, we provide a script to transfer the FLAME parameters to vertices. Execute the script
pre_process/param_to_vert.py
. The resulting*.npy
files should be located in thedatasets/mead/vertex
folder.
Click to expand
To train the model from scratch, follow the 2-stage training approach outlined below.For the first stage of training, use the following commands:
-
On Windows and Linux:
python train_all.py experiment=vqvae_prior state=new data=mead_prior model=model_vqvae_prior
-
If the Linux system has Slurm Workload Manager, use the following command:
sbatch train_vqvae_prior.sh
-
Optional Operation
Click to expand
- We use Hydra configuration, which allows us to easily override settings at runtime. For example, to change the GPU ID to 1 on a multi-GPU system, set
trainer.devices=[1]
. To load a small amount of data for debugging, setdata.debug=true
. - To resume training from a checkpoint, set the
state
to resume and specify thefolder
andversion
. Specifically, replace thefolder
andversion
in the command below with the folder name where the checkpoint is saved. Our program generates a random name for each run, and the version is assigned automatically by the program, which may vary depending on the operating system.python train_all.py experiment=vqvae_prior state=resume data=mead_prior model=model_vqvae_prior folder=outputs/MEAD/vqvae_prior/XXX version=0
- We use Hydra configuration, which allows us to easily override settings at runtime. For example, to change the GPU ID to 1 on a multi-GPU system, set
-
VAE variant
Click to expand
To train the VAE variant for comparison, follow the same instructions as above and change the
model
setting as below:python train_all.py experiment=vae_prior state=new data=mead_prior model=model_vae_prior
After completing stage 1 training, execute the following command to proceed with stage 2 training. Set model.folder
and model.version
to the location where the motion prior checkpoint is stored:
-
On Windows and Linux:
python train_all.py experiment=vqvae_pred state=new data=mead_pred model=model_vqvae_pred model.folder_prior=outputs/MEAD/vqvae_prior/XXX model.version_prior=0
-
If the Linux system has Slurm Workload Manager, use the following command. Remember to revise the
model.folder_prior
andmodel.version_prior
in the file.sbatch train_vqvae_pred.sh
-
Optional Operation
Click to expand
- Similar to the first stage of training, the GPU ID can be changed by setting
trainer.devices=[1]
, and debug mode can be enabled by settingdata.debug=true
. - To resume training from a checkpoint, set the state to
resume
and specify thefolder
andversion
:python train_all.py experiment=vqvae_pred state=resume data=mead_pred model=model_vqvae_pred folder=outputs/MEAD/vqvae_pred/XXX version=0 model.folder_prior=outputs/MEAD/vqvae_prior/XXX model.version_prior=0
- Similar to the first stage of training, the GPU ID can be changed by setting
-
VAE variant
Click to expand
To train the VAE variant for comparison, follow the same instructions as above and change the
model
setting as below:python train_all.py experiment=vae_pred state=new data=mead_pred model=model_vae_pred model.folder_prior=outputs/MEAD/vae_prior/XXX model.version_prior=0
Click to expand
Download the trained model weights from HERE and unzip them into the project root
folder.
We provide code to compute the evaluation metrics mentioned in our paper. To evaluate our trained model, run the following:
-
On Windows and Linux:
python evaluation.py folder=model_weights/ProbTalk3D/stage_2 number_of_samples=10
-
If the Linux system has Slurm Workload Manager, use the following command:
sbatch evaluation.sh
-
Optional Operation
Click to expand
- Adjust the GPU ID if necessary; for instance, set
device=1
. - To evaluate your own trained model, specify the
folder
andversion
according to the location where the checkpoint is saved:python evaluation.py folder=outputs/MEAD/vqvae_pred/XXX version=0 number_of_samples=10
- Adjust the GPU ID if necessary; for instance, set
-
VAE variant
Click to expand
To evaluate the trained VAE variant, execute the following command:
python evaluation.py folder=model_weights/VAE_variant/stage_2 number_of_samples=10
For qualitative evaluation, refer to the script evaluation_quality.py
.
Click to expand
Download the trained model weights from HERE and unzip them into the project root
folder.
Our model is trained to generate animations across 32 speaking styles (IDs), 8 emotions, and 3 intensities. Check all available conditions:
Click to expand
ID:M003, M005, M007, M009, M011, M012, M013, M019,
M022, M023, M024, M025, M026, M027, M028, M029,
M030, M031, W009, W011, W014, W015, W016, W018,
W019, W021, W023, W024, W025, W026, W028, W029
emotion:
neutral, happy, sad, surprised, fear, disgusted, angry, contempt
intensity (stands for low, medium, high intensity in order):
0, 1, 2
We provide several test audios. Run the following command to generate animations (with a random style) using the trained ProbTalk3D. This will produce .npy
files that can be rendered into videos.
-
On Windows and Linux:
python generation.py folder=model_weights/ProbTalk3D/stage_2 input_path=results/generation/test_audio
-
To specify styles for the provided test audios, use the following command. When setting style conditions for multiple files at once, ensure the setting order follows the filename sorting of Windows.
python generation.py folder=model_weights/ProbTalk3D/stage_2 input_path=results/generation/test_audio id=[\"M009\",\"M024\",\"W023\",\"W011\",\"W019\",\"M013\",\"M011\",\"W016\"] emotion=[\"angry\",\"contempt\",\"disgusted\",\"fear\",\"happy\",\"neutral\",\"sad\",\"surprised\"] intensity=[1,2,0,2,1,0,1,2]
-
Optional Operation
Click to expand
- To generate multiple outputs (for example, 2 outputs) using one test audio, set
number_of_samples=2
. - The default generation process uses stochastic sampling. To control diversity, adjust
temperature=X
. The default X value is 0.2; we recommend choosing between 0.1 to 0.5. - Our model can operate deterministically by setting
sample=false
, bypassing stochastic sampling. - To play with your own data, modify the
input_path
or place them in the folderresults/generation/test_audio
. - Adjust the GPU ID if necessary; for instance, set
device=1
. - To generate animation with your own trained model, specify the
folder
andversion
according to the location where the checkpoint is saved:python generation.py folder=outputs/MEAD/vqvae_pred/XXX version=0 input_path=results/generation/test_audio
- To generate multiple outputs (for example, 2 outputs) using one test audio, set
-
VAE variant
Click to expand
-
To generate animations (with a random style) using the trained VAE variant, run the following command:
python generation.py folder=model_weights/VAE_variant/stage_2 input_path=results/generation/test_audio
-
Similarly, follow the above instructions to specify the style or generate multiple files by setting
number_of_samples
. -
The default generation process sets the scale factor to 20. To control diversity, adjust
fact=X
. We recommend setting X between 1 and 40. Setting X as 1 means no scaling.
-
The generated .npy
files contain FLAME parameters and can be rendered into videos following the below instructions.
-
We use blender to render the predicted motion. First, download the dependencies from HERE and extract them into the
deps
folder. Please note that this command can only be executed on Windows:python render_param.py result_folder=results/generation/vqvae_pred/stage_2/0.2 audio_folder=results/generation/test_audio
-
Optional Operation
Click to expand
- To play with your own data, modify
result_folder
to where the generated.npy
files are stored, andaudio_folder
to where the.wav
files are located. - We provide post-processing code in the
post_process
folder. To change face shapes for the predicted motion, refer to the scriptchange_shape_param.py
. - To convert predicted motion to vertex space, refer to the script
post_process/transfer_to_vert.py
. For rendering animation in vertex space, use the following command on Windows and Linux:python render_vert.py result_folder=results/generation/vqvae_pred/stage_2/0.2 audio_folder=results/generation/test_audio
- To play with your own data, modify
-
VAE variant
Click to expand
To render the generated animations produced by the trained VAE variant, use the following command on Windows:
python render_param.py result_folder=results/generation/vae_pred/stage_2/20 audio_folder=results/generation/test_audio
Click to expand
For comparing with the diffusion model FaceDiffuser (modified version), navigate to the diffusion
folder.
To train the model from scratch, execute the following command:
python main.py
To quantitatively evaluate our trained FaceDiffuser model, run the following command:
python evaluation_facediff.py --save_path "../model_weights/FaceDiffuser" --max_epoch 50
To generate animations using our trained model, execute the following command. Modify the path and style settings as needed.
python predict.py --save_path "../model_weights/FaceDiffuser" --epoch 50 --subject "M009" --id "M009" --emotion 6 --intensity 1 --wav_path "../results/generation/test_audio/angry.wav"
Navigate back to the project root
folder and run the following command:
python render_vert.py result_folder=diffusion/results/generation audio_folder=results/generation/test_audio
If you find the code useful for your work, please consider starring this repository and citing it:
@inproceedings{Probtalk3D_Wu_MIG24,
author = {Wu, Sichun and Haque, Kazi Injamamul and Yumak, Zerrin},
title = {ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE},
booktitle = {The 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG '24), November 21--23, 2024, Arlington, VA, USA},
year = {2024},
location = {Arlington, VA, USA},
numpages = {12},
url = {https://doi.org/10.1145/3677388.3696320},
doi = {10.1145/3677388.3696320},
publisher = {ACM},
address = {New York, NY, USA}
}
We borrow and adapt the code from Learning to Listen, CodeTalker, TEMOS, FaceXHuBERT, FaceDiffuser. We appreciate the authors for making their code available and facilitating future research. Additionally, we are grateful to the creators of the 3DMEAD datasets used in this project.
Any third-party packages are owned by their respective authors and must be used under their respective licenses.
This repository is released under CC-BY-NC-4.0-International License.