Skip to content
forked from natlamir/DINet

The source code of "DINet: deformation inpainting network for realistic face visually dubbing on high resolution video."

Notifications You must be signed in to change notification settings

komilaria/DINet

 
 

Repository files navigation

DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video (AAAI2023)

Windows Install Instructions

Create new conda environment

conda create -n dinet python=3.7
conda activate dinet

Clone repository

git clone https://github.com/natlamir/DINet.git
cd DINet

Install Dependencies

pip install -r requirements.txt

Install torch 1.7.1

pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Install tensorflow 1.15.2

pip install tensorflow==1.15.2

Inference

Download resources (asserts.zip) in Google drive. unzip and put dir in ./.
  • Inference with example videos. Run
python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/testxxx.mp4 --source_openface_landmark_path=./asserts/examples/testxxx.csv --driving_audio_path=./asserts/examples/driving_audio_xxx.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth  

The results are saved in ./asserts/inference_result

  • Inference with custom videos.
    Note: The released pretrained model is trained on HDTF dataset with 363 training videos (video names are in ./asserts/training_video_name.txt), so the generalization is limited. It would be better to test custom videos with normal lighting, frontal view etc.(see the limitation section in the paper). We also release the training code, so if a larger high resolution audio-visual dataset is proposed in the further, you can use the training code to train a model with greater generalization. Besides, we release coarse-to-fine training strategy, so you can use the training code to train a model in arbitrary resolution (larger than 416x320 if gpu memory and training dataset are available).

Using openface to detect smooth facial landmarks of your custom video. We run the OpenFaceOffline.exe on windows 10 system with this setting:

Record Recording settings OpenFace setting View Face Detector Landmark Detector
2D landmark & tracked videos Mask aligned image Use dynamic AU models Show video Openface (MTCNN) CE-CLM

The detected facial landmarks are saved in "xxxx.csv". Run

python inference.py --mouth_region_size=256 --source_video_path= custom video path --source_openface_landmark_path=  detected landmark path --driving_audio_path= driving audio path --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth  

to realize face visually dubbing on your custom videos.

Training

Data Processing

We release the code of video processing on HDTF dataset. You can also use this code to process custom videos.

  1. Downloading videos from HDTF dataset. Splitting videos according to xx_annotion_time.txt and do not crop&resize videos.

  2. Resampling all split videos into 25fps and put videos into "./asserts/split_video_25fps". You can see the two example videos in "./asserts/split_video_25fps". We use software to resample videos. We provide the name list of training videos in our experiment. (pls see "./asserts/training_video_name.txt")

  3. Using openface to detect smooth facial landmarks of all videos. Putting all ".csv" results into "./asserts/split_video_25fps_landmark_openface". You can see the two example csv files in "./asserts/split_video_25fps_landmark_openface".

  4. Extracting frames from all videos and saving frames in "./asserts/split_video_25fps_frame". Run

python data_processing.py --extract_video_frame
  1. Extracting audios from all videos and saving audios in "./asserts/split_video_25fps_audio". Run
python data_processing.py --extract_audio
  1. Extracting deepspeech features from all audios and saving features in "./asserts/split_video_25fps_deepspeech". Run
python data_processing.py --extract_deep_speech
  1. Cropping faces from all videos and saving images in "./asserts/split_video_25fps_crop_face". Run
python data_processing.py --crop_face
  1. Generating training json file "./asserts/training_json.json". Run
python data_processing.py --generate_training_json

Training models

We split the training process into frame training stage and clip training stage. In frame training stage, we use coarse-to-fine strategy, so you can train the model in arbitrary resolution.

Frame training stage.

In frame training stage, we only use perception loss and GAN loss.

  1. Firstly, train the DINet in 104x80 (mouth region is 64x64) resolution. Run
python train_DINet_frame.py --augment_num=32 --mouth_region_size=64 --batch_size=24 --result_path=./asserts/training_model_weight/frame_training_64

You can stop the training when the loss converges (we stop in about 270 epoch).

  1. Loading the pretrained model (face:104x80 & mouth:64x64) and train the DINet in higher resolution (face:208x160 & mouth:128x128). Run
python train_DINet_frame.py --augment_num=100 --mouth_region_size=128 --batch_size=80 --coarse2fine --coarse_model_path=./asserts/training_model_weight/frame_training_64/xxxxxx.pth --result_path=./asserts/training_model_weight/frame_training_128

You can stop the training when the loss converges (we stop in about 200 epoch).

  1. Loading the pretrained model (face:208x160 & mouth:128x128) and train the DINet in higher resolution (face:416x320 & mouth:256x256). Run
python train_DINet_frame.py --augment_num=20 --mouth_region_size=256 --batch_size=12 --coarse2fine --coarse_model_path=./asserts/training_model_weight/frame_training_128/xxxxxx.pth --result_path=./asserts/training_model_weight/frame_training_256

You can stop the training when the loss converges (we stop in about 200 epoch).

Clip training stage.

In clip training stage, we use perception loss, frame/clip GAN loss and sync loss. Loading the pretrained frame model (face:416x320 & mouth:256x256), pretrained syncnet model (mouth:256x256) and train the DINet in clip setting. Run

python train_DINet_clip.py --augment_num=3 --mouth_region_size=256 --batch_size=3 --pretrained_syncnet_path=./asserts/syncnet_256mouth.pth --pretrained_frame_DINet_path=./asserts/training_model_weight/frame_training_256/xxxxx.pth --result_path=./asserts/training_model_weight/clip_training_256

You can stop the training when the loss converges and select the best model (our best model is at 160 epoch).

Acknowledge

The AdaAT is borrowed from AdaAT. The deepspeech feature is borrowed from AD-NeRF. The basic module is borrowed from first-order. Thanks for their released code.

About

The source code of "DINet: deformation inpainting network for realistic face visually dubbing on high resolution video."

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%