Model architectures

TensorFlowTTS - Speech Synthesis for New Zealand English

Real-Time State-of-the-art Speech Synthesis with Tensorflow 2 for New Zealand English

Requirements

This repository is tested on Ubuntu 18.04 with:

Python 3.6.9+
Cuda 10.1
CuDNN 7.6.5
Tensorflow 2.2/2.3
Tensorflow Addons >= 0.10.0 (If necessary, use 0.10.0 with tensorflow-gpu==2.2)

Installation

With pip

$ pip install TensorFlowTTS

From source

Examples are included in the repository but are not shipped with the framework. Therefore, to run the latest version of examples, you need to install the source below.

$ git clone https://github.com/sabby993/TensorFlowTTS.git
$ cd TensorFlowTTS
$ pip install .

If you want to upgrade the repository and its dependencies:

$ git pull
$ pip install --upgrade .

Model architectures

MelGAN released with the paper MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis by Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron Courville.
Tacotron-2 released with the paper Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu.

Audio Samples

Here in an audio samples on valid set in New Zealand English. You can find the audio sample at the end of the colab notebook and also the text will be displayed above the allignemnt and spectrogram plots. tacotron-2

Tutorial End-to-End

Prepare Dataset

Prepare a dataset in the following format:

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...

Where metadata.csv has the following format: id|transcription. This is a ljspeech-like format; for nz_cw dataset, we modeled it according to the LJ Speech format. By doing this, we can use the LJ Speech processor to pre-process and train the nz_cw dataset.

Note that NAME_DATASET should be [ljspeech/kss/baker/libritts/nz_cw] for example.

Preprocessing

The preprocessing has two steps:

Preprocess audio features
- Convert characters to IDs
- Compute mel spectrograms
- Normalize mel spectrograms to [-1, 1] range
- Split the dataset into train and validation
- Compute the mean and standard deviation of multiple features from the training split
Standardize mel spectrogram based on computed statistics

To reproduce the steps above:

tensorflow-tts-preprocess --rootdir ./[ljspeech/nz_cw] --outdir ./dump_[ljspeech/nz_cw] --config preprocess/[ljspeech]_preprocess.yaml --dataset [ljspeech]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/nz_cw] --outdir ./dump_[ljspeech/nz_cw] --config preprocess/[ljspeech]_preprocess.yaml --dataset [ljspeech]

After preprocessing, the structure of the project folder should be:

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...
|- dump_[ljspeech/nz_cw]/
|   |- train/
|       |- ids/
|           |- LJ001-0001-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0001-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0001-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0001-wave.npy
|           |- ...
|   |- valid/
|       |- ids/
|           |- LJ001-0009-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0009-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0009-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0009-wave.npy
|           |- ...
|   |- stats.npy
|   |- stats_f0.npy
|   |- stats_energy.npy
|   |- train_utt_ids.npy
|   |- valid_utt_ids.npy
|- examples/
|   |- melgan/
|   |- fastspeech/
|   |- tacotron2/
|   ...

stats.npy contains the mean and std from the training split mel spectrograms
stats_energy.npy contains the mean and std of energy values from the training split
stats_f0.npy contains the mean and std of F0 values in the training split
train_utt_ids.npy / valid_utt_ids.npy contains training and validation utterances IDs respectively

We use suffix (ids, raw-feats, raw-energy, raw-f0, norm-feats, and wave) for each input type.

Training the model:

After the above steps, a folder named ‘dump_ljspeech’ or 'dump_nz_cw' will be created inside the repository which will contain numpy files of the training and validation datasets.

Now we can start the training process which will need the actual use of the GPU. We can start training by running the following code:

!CUDA_VISIBLE_DEVICES=0 python examples/tacotron2/train_tacotron2.py \
--train-dir ./dump_ljspeech/train/ \
--dev-dir ./dump_ljspeech/valid/ \
--outdir ./examples/tacotron2/exp/train.tacotron2.v1/ \
--config ./examples/tacotron2/conf/tacotron2.v1.yaml \
--use-norm 1 \
--mixed_precision 0 \
--pretrained ./examples/tacotron2/exp/train.tacotron2.v1/checkpoints/model-120000.h5

The pretrained argument takes in the path of the pretrained model which is the base model on top of which the current dataset is fine-tuned upon. This code will run for quite some time and checkpoints will be saved every 2000 steps of training and a checkpoint file will be created inside the examples folder (examples/tacotron2/exp/train.tacotron2.v1/checkpoints/ckpt-2000). We can resume the training from the checkpoint by running the training command and adding the checkpoint directory after the --resume part, e.g

!CUDA_VISIBLE_DEVICES=0 python examples/tacotron2/train_tacotron2.py \
 --train-dir ./dump_ljspeech/train/ \
--dev-dir ./dump_ljspeech/valid/ \
--outdir ./examples/tacotron2/exp/train.tacotron2.v1/ \
--config ./examples/tacotron2/conf/tacotron2.v1.yaml \
--use-norm 1 \
--mixed_precision 0 \
--resume ./examples/tacotron2/exp/train.tacotron2.v1/checkpoints/ckpt-2000

Tensorboard graphs for LJ Speech trained model -->

Tensorboard graphs for New Zealand English trained model fine tuned on the LJ Speech model -->

The colab notebook used for training nz_cw dataset on top of pretrained LJ Speech model can be found here Training and Tensorboard Graphs

Inference

The whole inference steps are there in the colab notebook here - Inference

Results

To show the difference in accent properties between the General American English and New Zealand English, we employed RStudio to get vowel space plots of both the synthesized speech models speaking hVd words and the plots are below:

For New Zealand English -->

For General American English -->

For New Zealand Real Voice -->

As we can see from the vowel plots, the Tacotron-2 synthesized voice of New Zealand English is quite close to the vowel space plot of the New Zealand English real voice plot in terms of position of the vowels in the F1/F2 space and the plots are quite different for the General American English model vowel plot. This concludes that our training with the New Zealand Engish dataset changed the accent properties of the original American English model and got it close to the real New Zealand English voice according to the vowel plots. The rest of the plot with all the points can be found in the following directory - tensorflow_tts/vowel_plots.

Pretrained Models

We have trained the tacotron-2 model upto 28k steps with the New Zealand English dataset on top of a pre-trained General American English Tacotron-2 model. Both the New Zealand English and General American English model are located at the following link: Models and Notebooks

Name		Name	Last commit message	Last commit date
Latest commit History 655 Commits
.github/workflows		.github/workflows
examples		examples
notebooks		notebooks
preprocess		preprocess
tensorflow_tts		tensorflow_tts
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorFlowTTS - Speech Synthesis for New Zealand English

Real-Time State-of-the-art Speech Synthesis with Tensorflow 2 for New Zealand English

Requirements

Installation

With pip

From source

Model architectures

Audio Samples

Tutorial End-to-End

Prepare Dataset

Preprocessing

Training the model:

Inference

Results

Pretrained Models

About

Releases

Packages

Languages

License

neng668/TensorFlowTTS

Folders and files

Latest commit

History

Repository files navigation

TensorFlowTTS - Speech Synthesis for New Zealand English

Real-Time State-of-the-art Speech Synthesis with Tensorflow 2 for New Zealand English

Requirements

Installation

With pip

From source

Model architectures

Audio Samples

Tutorial End-to-End

Prepare Dataset

Preprocessing

Training the model:

Inference

Results

Pretrained Models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages