Flow-based TTS with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.
This repository contains the source code and several checkpoints for our work based on RADTTS. RADTTS is a normalizing-flow-based TTS framework with state of the art acoustic fidelity and a highly robust audio-transcription alignment module. Our project page and some samples can be found here, with relevant works listed here.
This repository can be used to train the following models:
- A normalizing-flow bipartite architecture for mapping text to mel spectrograms
- A variant of the above, conditioned on F0 and Energy
- Normalizing flow models for explicitly modeling text-conditional phoneme duration, fundamental frequency (F0), and energy
- A standalone alignment module for learning unspervised text-audio alignments necessary for TTS training
We provide a checkpoint and config for a HiFi-GAN vocoder trained on LibriTTS 100 and 360.
For a HiFi-GAN vocoder trained on LJS, please download the v1 model provided by the HiFi-GAN authors here, .
Model name | Description | Dataset |
---|---|---|
RADTTS++DAP-LJS | RADTTTS model conditioned on F0 and Energy with deterministic attribute predictors | LJSpeech Dataset |
We will soon provide more pre-trained RADTTS models with generative attribute predictors trained on LJS and LibriTTS. Stay tuned!
- Clone this repo:
git clone https://github.com/NVIDIA/RADTTS.git
- Install python requirements or build docker image
- Install python requirements:
pip install -r requirements.txt
- Install python requirements:
- Update the filelists inside the filelists folder and json configs to point to your data
basedir
– the folder containing the filelists and the audiodiraudiodir
– name of the audiodirfilelist
– | (pipe) separated text file with relative audiopath, text, speaker, and optionally categorical label and audio duration in seconds
- Train the decoder
python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir
- Further train with the duration predictor
python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir_dir train_config.warmstart_checkpoint_path=model_path.pt model_config.include_modules="decatndur"
- Train the decoder
python train.py -c config_ljs_decoder.json -p train_config.output_directory=outdir
- Train the attribute predictor: autoregressive flow (agap), bi-partite flow (bgap) or deterministic (dap)
python train.py -c config_ljs_{agap,bgap,dap}.json -p train_config.output_directory=outdir_wattr train_config.warmstart_checkpoint_path=model_path.pt
- Download our pre-trained model
python train.py -c config.json -p train_config.ignore_layers_warmstart=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path=model_path.pt
python -m torch.distributed.launch --use_env --nproc_per_node=NUM_GPUS_YOU_HAVE train.py -c config.json -p train_config.output_directory=outdir
python inference.py -c CONFIG_PATH -r RADTTS_PATH -v HG_PATH -k HG_CONFIG_PATH -t TEXT_PATH -s ljs --speaker_attributes ljs --speaker_text ljs -o results/
python inference_voice_conversion.py --radtts_path RADTTS_PATH --radtts_config_path RADTTS_CONFIG_PATH --vocoder_path HG_PATH --vocoder_config_path HG_CONFIG_PATH --f0_mean=211.413 --f0_std=46.6595 --energy_mean=0.724884 --energy_std=0.0564605 --output_dir=results/ -p data_config.validation_files="{'Dummy': {'basedir': 'data/', 'audiodir':'22khz', 'filelist': 'vc_audiopath_txt_speaker_emotion_duration_filelist.txt'}}"
Filename | Description | Nota bene |
---|---|---|
config_ljs_decoder.json | Config for the decoder conditioned on F0 and Energy | |
config_ljs_radtts.json | Config for the decoder not conditioned on F0 and Energy | |
config_ljs_agap.json | Config for the Autoregressive Flow Attribute Predictors | Requires at least pre-trained alignment module |
config_ljs_bgap.json | Config for the Bi-Partite Flow Attribute Predictors | Requires at least pre-trained alignment module |
config_ljs_dap.json | Config for the Deterministic Attribute Predictors | Requires at least pre-trained alignment module |
Unless otherwise specified, the source code within this repository is provided under the MIT License
The code in this repository is heavily inspired by or makes use of source code from the following works:
- Tacotron implementation from Keith Ito
- STFT code from Prem Seetharaman
- Masked Autoregressive Flows
- Flowtron
- Source for neural spline functions used in this work: https://github.com/ndeutschmann/zunis
- Original Source for neural spline functions: https://github.com/bayesiains/nsf
- Bipartite Architecture based on code from WaveGlow
- HiFi-GAN
- Glow-TTS
Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro.
One TTS Alignment to Rule Them All. ICASSP 2022
Kevin J Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro.
RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis.
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021
Kevin J Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro.
Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows. Technical Report