This repository is a fork of Fatcord's Alternative WaveRNN implementation. The original model has been significantly simplified to allow real-time synthesis of high fidelity speech. This repository also contains a C++ library that can be used for real-time speech synthesis on a single CPU core.
WaveRNN-Pytorch is a vocoder - it converts from speech features (i.e. mel spectrograms) to speech sound. On can build a complete text-to-speech pipeline by using, for example, Tacotron-2 to turn text into speech features, and then use this vocoder to produce a sound file.
- 10 bit quantized wav modeling for higher quality
- Weight pruning for reducing model complexity
- Fast, CPU only, C++ inference library running faster than real time on modern cpu.
- Compressed pruned weight format to make weight files small
- Python bindings for the C++ library
- Can be used with a Tacotron-2 implementation for TTS.
- Real time inference on modern ARM processors (e.g. inference on smartphone for high quality TTS)
- See Wiki
- See "model_outputs" directory
- Python 3
- CUDA >=8.0
- PyTorch >= v1.0
- Python requirements:
pip install -r requirements.txt
- sudo aptitude install libsoundtouch-dev
- cmake, gcc, etc
- Eigen3 development files
apt-get install libeigen3-dev
Ensure above requirements are met.
git clone https://github.com/geneing/WaveRNN-Pytorch.git
cd WaveRNN-Pytorch
pip3 install -r requirements.txt
cd library
mkdir build
cd build
cmake ../src
make
cp WaveRNNVocoder*.so python_install_directory
Before running scripts, one can adjust hyperparameters in hparams.py
.
Some hyperparameters that you might want to adjust:
input_type
(best performing ones are currentlybits
andraw
, seehparams.py
for more details)batch_size
- depends on your GPU memory. For 8GB memory, you should use batch_size=64save_every_step
(checkpoint saving frequency)evaluate_every_step
(evaluation frequency)seq_len_factor
(sequence length of training audio, the longer the more GPU it takes)
If you are planning to use this vocoder together with a TTS network (e.g. Tacotron-2) you should train on exactly the same data. Each implementation of TTS network uses slightly different definition of "mel-spectrogram". I recommend using TTS preoprocessing.
This code has been tested with Tacotron-2 and M-AILABS dataset. Example:
cd Tacotron-2
python3 preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True --output training_data
If you are using vocoder as standalone library you can use native preprocessing. This function processes raw wav files into corresponding mel-spectrogram and wav files according to the audio processing hyperparameters.
Example usage:
python3 preprocess.py --output_dir training_data /path/to/my/wav/files
This will process all the .wav
files in the folder /path/to/my/wav/files
and save them in the default local directory called data_dir
.
Start training process. checkpoints are by default stored in the local directory checkpoints
.
The script will automatically save a checkpoint when terminated by crtl + c
.
Example 1: starting a new model for training from Tacotron-2 data
python3 train.py --dataset Tacotron training_data
training_data
is the directory containing the processed files.
Example 2: starting a new model for training
python3 train.py --dataset Audiobooks training_data
Example 3: Restoring training from checkpoint
python3 train.py training_data --checkpoint=checkpoints/checkpoint0010000.pth
Evaluation .wav
files and plots are saved in checkpoints/eval
.
First you need to train the model for at least (hparams.start_prune+hparams.prune_steps) steps to ensure that the model is properly pruned.
In order to use C++ library you need to convert the trained network to compressed model format.
python3 convert_model.py --output-dir model_outputs checkpoints/checkpoint_step000400000.pth
Example 1: Use python3 interface to the C++ library
import WaveRNNVocoder
import numpy as np
vocoder=WaveRNNVocoder.Vocoder()
vocoder.loadWeights('model_outputs/model.bin')
mel = np.load(fname) #make sure that mel.shape[0] == hparams.num_mels
wav = vocoder.melToWav(mel)