This repository is an implementation of the waveform synthesizer in DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS. It also has code to reproduce the SaShiMi+DiffWave experiments from It’s Raw! Audio Generation with State-Space Models (Goel et al. 2022).
This is a fork/combination of the implementations philsyn/DiffWave-unconditional and philsyn/DiffWave-Vocoder. This repo uses Git LFS to store model checkpoints, which unfortunately does not work with public forks. For this reason it is not an official GitHub fork.
This repository aims to provide a clean implementation of the DiffWave audio diffusion model.
The checkpoints
branch of this repository has the original code used for reproducing experiments from the SaShiMi paper (instructions).
The master
branch of this repository has the latest versions of the S4/SaShiMi model and can be used to train new models from scratch.
Compared to the parent fork for DiffWave, this repository has:
- Both unconditional (SC09) and vocoding (LJSpeech) waveform synthesis. It's also designed in a way to be easy to add new datasets
- Significantly improved infrastructure and documentation
- Configuration system with Hydra for modular configs and flexible command-line API
- Logging with WandB instead of Tensorboard, including automatically generating and uploading samples during training
- Vocoding does not require a separate pre-processing step to generate spectrograms, making it easier and less error-prone to use
- Option to replace WaveNet with the SaShiMi backbone (based on the S4 layer)
- Pretrained checkpoints and samples for both DiffWave (+Wavenet) and DiffWave+SaShiMi
These are some features that would be nice to add. PRs are very welcome!
- Use the pip S4 package once it's released, instead of manually updating the standalone files
- Mixed-precision training
- Fast inference procedure from later versions of the DiffWave paper
- Can add an option to allow original Tensorboard logging instead of WandB (code is still there, just commented out)
- The different backbones (WaveNet and SaShiMi) can be consolidated more cleanly with the diffusion logic factored out
A basic experiment can be run with python train.py
.
This default config is for SC09 unconditional generation with the SaShiMi backbone.
Configuration is managed by Hydra.
Config files are under configs/
.
Examples of different configs and running experiments via command line are provided throughout this README.
Hydra has a steeper learning curve than standard argparse
-based workflows, but offers much more flexibility and better experiment management. Feel free to file issues for help with configs.
By default, all available GPUs are used (according to torch.cuda.device_count()
).
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module, or e.g. CUDA_VISIBLE_DEVICES=0,1 python train.py
.
Unconditional generation uses the SC09 dataset by default, while vocoding uses LJSpeech.
The entry dataset_config.data_path
of the config should point to the desired folder, e.g. data/sc09
or data/LJSpeech-1.1/wavs
For SC09, extract the Speech Commands dataset and move the digits subclasses into a separate folder, e.g. ./data/sc09/{zero,one,two,three,four,five,six,seven,eight,nine}
Download the LJSpeech dataset into a folder. For example (as of 2022/06/28):
mkdir data && cd data
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xvf LJSpeech-1.1.tar.bz2
The waveforms should be organized in the folder ./data/LJSpeech-1.1/wavs
All models use the DiffWave diffusion model with either a WaveNet or SaShiMi backbone.
The model configuration is controlled by the dictionary model
inside the config file.
The flags in_channels
and out_channels
control the data dimension.
The three flags diffusion_step_embed_dim_{in,mid,out}
control DiffWave parameters for the diffusion step embedding.
The flag unconditional
determines whether to do unconditional or conditional generation.
These flags are common to both backbones.
The WaveNet backbone is used by setting model._name_=wavenet
.
The parameters res_channels
, skip_channels
, num_res_layers
, and dilation_cycle
control the WaveNet backbone.
The SaShiMi backbone is used by setting model._name_=sashimi
.
The parameters are:
unet: # If true, use S4 layers in both the downsample and upsample parts of the backbone. If false, use S4 layers in only the upsample part.
d_model: # Starting model dimension of outermost stage
n_layers: # Number of layers per stage
pool: # List of pooling factors per stage (e.g. [4, 4] means three total stages, pooling by a factor of 4 in between each)
expand: # Multiplicative factor to increase model dimension between stages
ff: # Feedforward expansion factor: MLP block has dimensions d_model -> d_model*ff -> d_model)
A basic experiment can be run with python train.py
, which defaults to python train.py experiment=sc09
(SC09 unconditional waveform synthesis).
Experiments are saved under exp/<run>
with an automatically generated run name identifying the experiment (model and setting).
Checkpoints are saved to exp/<run>/checkpoint/
and generated audio samples to exp/<run>/waveforms/
.
Set wandb.mode=online
to turn on WandB logging, or wandb.mode=disabled
to turn it off.
Standard wandb arguments such as entity and project are configurable.
To resume from the checkpoint exp/<run>/checkpoint/1000.pkl
, simply re-run the same training command with the additional flag train.ckpt_iter=1000
.
Passing in train_config.ckpt_iter=max
resumes from the last checkpoint, and train_config.ckpt_iter=-1
trains from scratch.
Use wandb.id=<id>
to resume logging to a previous run.
After training with python train.py <flags>
, python generate.py <flags>
generates samples according to the generate
dictionary of the config.
For example,
python generate.py <flags> generate.ckpt_iter=500000 generate.n_samples=256 generate.batch_size=128
generates 256 samples per GPU, at a batch size of 128 per GPU, from the model specified in the config at checkpoint iteration 500000.
Generated samples will be stored in exp/<run>/waveforms/<generate.ckpt_iter>/
- After downloading the data, make the config's
dataset.data_path
point to the.wav
files - Toggle
model.unconditional=false
- Pass in the name of a
.wav
file for generation, e.g.generate.mel_name=LJ001-0001
. Every checkpoint, vocoded samples for this audio file will be logged to wandb
Currently, vocoding is only set up for the LJSpeech dataset. See the config configs/experiment/ljspeech.yaml
for details.
The following is an example command for LJSpeech vocoding with a smaller SaShiMi model. A checkpoint for this model at 200k steps is also provided.
python train.py experiment=ljspeech model=sashimi model.d_model=32 wandb.mode=online
Another example with a smaller WaveNet backbone, similar to the results from the DiffWave paper:
python train.py experiment=ljspeech model=wavenet model.res_channels=64 model.skip_channels=64 wandb.mode=online
Generation can be done in the usual way, conditioning on any spectrogram, e.g.
python generate.py experiment=ljspeech model=sashimi model.d_model=32 generate.mel_name=LJ001-0002
Other DiffWave vocoder implementations such as https://github.com/philsyn/DiffWave-Vocoder and https://github.com/lmnt-com/diffwave require first generating spectrograms in a separate pre-processing step. This implementation does not require this step, which we find more convenient. However, pre-processing and saving the spectrograms is still possible.
To generate a folder of spectrograms according to the dataset
config, run the mel2samp
script and specify an output directory (e.g. here 256 refers to the hop size):
python -m dataloaders.mel2samp experiment=ljspeech +output_dir=mel256
Then during training or generation, add in the additional flag generate.mel_path=mel256
to use the pre-processed spectrograms, e.g.
python generate.py experiment=ljspeech model=sashimi model.d_model=32 generate.mel_name=LJ001-0002 generate.mel_path=mel256
The remainder of this README pertains only to pre-trained models from the SaShiMi paper.
The branch git checkout checkpoints
provides checkpoints for these models.
This branch is meant only for reproducing generated samples from the SaShiMi paper from ICML 2022 - please do not attempt train-from-scratch results from this code. The older models in this branch have issues that are explained below.
Training from scratch is covered in the previous part of this README and should be done from the master
branch.
Install Git LFS and git lfs pull
to download the checkpoints.
For each of the provided checkpoints, 16 audio samples are provided.
More pre-generated samples for all models from the SaShiMi paper can be downloaded from: https://huggingface.co/krandiash/sashimi-release/tree/main/samples/sc09
The below four models correspond to "sashimi-diffwave", "sashimi-diffwave-small", "diffwave", and "diffwave-small" respectively. Command lines are also provided to reproduce these samples (up to random seed).
The version of S4 used in the experiments in the SaShiMi paper is an outdated version of S4 from January 2022 that predates V2 (February 2022) of the S4 repository. S4 is currently on V3 as of July 2022.
- Experiment folder:
exp/unet_d128_n6_pool_2_expand2_ff2_T200_betaT0.02_uncond/
- Train from scratch:
python train.py model=sashimi train.ckpt_iter=-1
- Resume training:
python train.py model=sashimi train.ckpt_iter=max train.learning_rate=1e-4
(as described in the paper, this model used a manual learning rate decay after 500k steps)
Generation examples:
python generate.py experiment=sc09 model=sashimi
(Latest model at 800k steps)python generate.py experiment=sc09 model=sashimi generate.ckpt_iter=500000
(Earlier model at 500k steps)python generate.py generate.n_samples=256 generate.batch_size=128
(Generate 256 samples per GPU with the largest batch that fits on an A100. The paper uses this command to generate 2048 samples on an 8xA100 machine for evaluation metrics.)
Experiment folder: exp/unet_d64_n6_pool_2_expand2_ff2_T200_betaT0.02_uncond/
Train (since the model is smaller, you can increase the batch size and logged samples per checkpoint):
python train.py experiment=sc09 model=sashimi model.d_model=64 train.batch_size_per_gpu=4 generate.n_samples=32
Generate:
python generate.py experiment=sc09 model=sashimi model.d_model=64 generate.n_samples=256 generate.batch_size=256
The WaveNet backbone provided in the parent fork had a small bug where it used x += y
instead of x = x + y
.
This can cause a difficult-to-trace error in some PyTorch + environment combinations (but sometimes it works; I never figured out when it's ok).
These two lines are fixed in the master branch of this repo.
However, for some reason when models are trained using the wrong code and loaded using the correct code, the model runs fine but produces inconsistent outputs, even in inference mode (i.e. generation produces static noise). So this branch for reproducing the checkpoints uses the incorrect version of these two lines. This allows generating from the pre-trained models, but may not train in some environments. If anyone knows why this happens, I would love to know! Shoot me an email or file an issue!
- Experiment folder:
exp/wnet_h256_d36_T200_betaT0.02_uncond/
- Usage:
python <train|generate>.py model=wavenet
More notes:
- The fully trained model (1000000 steps) is the original checkpoint from the original repo philsyn/DiffWave-unconditional
- The checkpoint at 500000 steps is our version trained from scratch
- These should both be compatible with this codebase (e.g. generation works with both), but for some reason the original
checkpoint/1000000.pkl
file is much smaller than ourcheckpoint/500000.pkl
- I don't remember if I changed anything in the code to cause this; perhaps it could also be differences in PyTorch or versions or environments?
Experiment folder: exp/wnet_h128_d30_T200_betaT0.02_uncond/
Train:
python train.py model=wavenet model.res_channels=128 model.num_res_layers=30 model.dilation_cycle=10 train.batch_size_per_gpu=4 generate.n_samples=32
A shorthand model config is also defined
python train.py model=wavenet_small train.batch_size_per_gpu=4 generate.n_samples=32
The parent fork has a few pretrained LJSpeech vocoder models. Because of the WaveNet bug, we recommend not using these and simply training from scratch from the master
branch; these vocoder models are small and faster to train than the unconditional SC09 models.
Feel free to file an issue for help with configs.