This is the official pytorch implementation of the TIP paper "Generating Visually Aligned Sound from Videos" and the corresponding Visually Aligned Sound (VAS) dataset.
Demo videos containing sound generation results can be found here.
- We release the pre-computed features for the testset of Dog category, together with the pre-trained RegNet. You can use them for generating dog sounds by yourself. (23/11/2020)
Clone this repository into a directory. We refer to that directory as REGNET_ROOT
.
git clone https://github.com/PeihaoChen/regnet
cd regnet
Create a new Conda environment.
conda create -n regnet python=3.7.1
conda activate regnet
Install PyTorch and other dependencies.
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0
conda install ffmpeg -n regnet -c conda-forge
pip install -r requirements.txt
In our paper, we collect 8 sound types (Dog, Fireworks, Drum, Baby form VEGAS and Gun, Sneeze, Cough, Hammer from AudioSet) to build our Visually Aligned Sound (VAS) dataset.
Please first download VAS dataset and unzip the data to $REGNET_ROOT/data/
folder.
For each sound type in AudioSet, we download all videos from Youtube and clean data on Amazon Mechanical Turk (AMT) using the same way as VEGAS.
unzip ./data/VAS.zip -d ./data
Run data_preprocess.sh
to preprocess data and extract RGB and optical flow features.
Notice: The script we provided to calculate optical flow is easy to run but is resource-consuming and will take a long time. We strongly recommend you to refer to TSN repository and their built docker image (our paper also uses this solution) to speed up optical flow extraction and to restrictly reproduce the results.
source data_preprocess.sh
Training the RegNet from scratch. The results will be saved to ckpt/dog
.
CUDA_VISIBLE_DEVICES=7 python train.py \
save_dir ckpt/dog \
auxiliary_dim 32 \
rgb_feature_dir data/features/dog/feature_rgb_bninception_dim1024_21.5fps \
flow_feature_dir data/features/dog/feature_flow_bninception_dim1024_21.5fps \
mel_dir data/features/dog/melspec_10s_22050hz \
checkpoint_path ''
In case that the program stops unexpectedly, you can continue training.
CUDA_VISIBLE_DEVICES=7 python train.py \
-c ckpt/dog/opts.yml \
checkpoint_path ckpt/dog/checkpoint_018081
During inference, our RegNet will generate visually aligned spectrogram, and then use WaveNet as vocoder to generate waveform from spectrogram. You should first download our trained WaveNet model for different sound categories ( Dog, Fireworks, Drum, Baby, Gun, Sneeze, Cough, Hammer ).
The generated spectrogram and waveform will be saved at ckpt/dog/inference_result
CUDA_VISIBLE_DEVICES=7 python test.py \
-c ckpt/dog/opts.yml \
aux_zero True \
checkpoint_path ckpt/dog/checkpoint_041000 \
save_dir ckpt/dog/inference_result \
wavenet_path /path/to/wavenet_dog.pth
If you want to train your own WaveNet model, you can use WaveNet repository.
git clone https://github.com/r9y9/wavenet_vocoder && cd wavenet_vocoder
git checkout 2092a64
You can also use our pre-trained RegNet and pre-computed features for generating visually aligned sounds.
First, download and unzip the pre-computed features (Dog) to ./data/features/dog
folder.
cd ./data/features/dog
tar -xvf features_dog_testset.tar # unzip
Second, download and unzip our pre-trained RegNet (Dog) to ./ckpt/dog
folder.
cd ./ckpt/dog
tar -xvf ./ckpt/dog/RegNet_dog_checkpoint_041000.tar # unzip
Third, run the inference code.
CUDA_VISIBLE_DEVICES=0 python test.py \
-c config/dog_opts.yml \
aux_zero True \
checkpoint_path ckpt/dog/checkpoint_041000 \
save_dir ckpt/dog/inference_result \
wavenet_path /path/to/wavenet_dog.pth
Enjoy your experiments!
Please cite the following paper if you feel RegNet useful to your research
@Article{chen2020regnet,
author = {Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang and Chuang Gan},
title = {Generating Visually Aligned Sound from Videos},
journal = {TIP},
year = {2020},
}
For any question, please file an issue or contact
Peihao Chen: phchencs@gmail.com
Hongdong Xiao: xiaohongdonghd@gmail.com