Skip to content

An implementation of "Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language" https://arxiv.org/abs/1810.11960

License

Notifications You must be signed in to change notification settings

nii-yamagishilab/self-attention-tacotron

Repository files navigation

Self-attention Tacotron

An implementation of "Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language" https://arxiv.org/abs/1810.11960

Notice: Our work in the paper uses a proprietary Japanese speech corpus with manually annotated labels. Since we cannot provide a exact reproducer in public, this repository replaces dataset related codes with examples for publicly available corpus.

Requirements

Python 3.6 or above is required.

This project uses Bazel as a build tool. This project depends on Tacotron2 implementation and Bazel automatically resolve the dependency with proper version.

  • Python >= 3.6
  • Bazel >= 0.18.0

If you are not familiar with Bazel, you can use a python command directly by setting external dependencies by yourself. See this document for details.

The following python packages should be installed.

For training and prediction

  • tensorflow >= 1.11
  • librosa >= 0.6.1
  • scipy >= 1.1.1
  • matplotlib >= 2.2.2
  • docopt >= 0.6.2

For test:

  • hypothesis >= 3.59.1

For pre-processing:

  • tensorflow >= 1.11
  • docopt >= 0.6.2
  • pyspark >= 2.3.0
  • unidecode >= 1.0.22
  • inflect >= 1.0.1

Preparing data

Pre-process phase generates source and target files in TFRecord format, list containing keys to identify each samples, and hyper parameters. The source and target files have .source.tfrecord and .target.tfrecord extension respectively. The list file is named as list.csv. You have to split list.csv into train.csv, validation.csv, and test.csv. Hyper parameters are generated in hparams.json. Th important parameters are average_mel_level_db and stddev_mel_level_db. These parameters can be used to normalize spectrogram at training time.

Example configurations for VCTK and LJSpeech can be found in examples/vctk and examples/ljspeech.

For VCTK, after downloading the corpus, run the following commands. We recommend to store source and target files separately. You can use --source-only and --target-only option to do that.

bazel run preprocess_vctk -- --source-only --hparam-json-file=self-attention-tacotron/examples/vctk/self-attention-tacotron.json /path/to/VCTK-Corpus  /path/to/source/output/dir
bazel run preprocess_vctk -- --target-only --hparam-json-file=self-attention-tacotron/examples/vctk/self-attention-tacotron.json /path/to/VCTK-Corpus  /path/to/target/output/dir

For LJSpeech, run the following commands.

bazel run preprocess_ljspeech -- --source-only --hparam-json-file=self-attention-tacotron/examples/ljspeech/self-attention-tacotron.json /path/to/LJSpeech-1.1  /path/to/source/output/dir
bazel run preprocess_ljspeech -- --target-only --hparam-json-file=self-attention-tacotron/examples/ljspeech/self-attention-tacotron.json /path/to/LJSpeech-1.1  /path/to/target/output/dir

Training

Training script conducts training and validation. Validation starts at a certain steps passed. You can control the steps to start validation by setting save_checkpoints_steps. We do not support tensorflow below version 1.11, because behavior of training and validation is different.

examples contains configurations for two models: Self-attention Tacotron and baseline Tacotron. You can find the configuration files for each model at self-attention-tacotron.json and tacotron.json.

You can run training by the following command, as an example for Self-attention Tacotron with VCTK dataset.

bazel run train -- --source-data-root=/path/to/source/output/dir --target-data-root=/path/to/target/output/dir --checkpoint-dir=/path/to/save/checkpoints --selected-list-dir=self-attention-tacotron/examples/vctk --hparam-json-file=self-attention-tacotron/examples/vctk/self-attention-tacotron.json

At validation phase, predicted alignments and spectrogram are generated in the checkpoint directory.

You can see summaries like loss value with tensorboard. Please check loss_with_teacher and mel_loss_with_teacher for validation metrics. xxx_with_teacher means it is calculated with teacher forcing. Since alignment of ground truth and predicted spectrogram does not match normally, reliable metrics are ones with teacher forcing.

Prediction

You can predict spectrogram with a trained model by the following command, as an example for LJSpeech dataset.

bazel run predict_mel -- --source-data-root=/path/to/source/output/dir --target-data-root=/path/to/target/output/dir --checkpoint-dir=/path/to/save/checkpoints --output-dir=/path/to/output/results --selected-list-dir=self-attention-tacotron/examples/vctk --hparam-json-file=self-attention-tacotron/examples/ljspeech/self-attention-tacotron.json

There are files with .mfbsp extension among generated files. These files are compatible with @TonyWangX 's WaveNet. You can find an instruction for waveform inversion with the WaveNet here.

Force alignment mode

Force alignment enables to calculate alignment from ground truth spectrogram and use it for predicting spectrogram.

You can use force alignment mode by specifying use_forced_alignment_mode=True as hparams. The following example enables force alignment mode by replacing hparams with --hparams=use_forced_alignment_mode=True.

bazel run predict_mel -- --source-data-root=/path/to/source/output/dir --target-data-root=/path/to/target/output/dir --checkpoint-dir=/path/to/save/checkpoints --output-dir=/path/to/output/results --selected-list-dir=self-attention-tacotron/examples/vctk --hparams=use_forced_alignment_mode=True --hparam-json-file=self-attention-tacotron/examples/ljspeech/self-attention-tacotron.json

Running tests

bazel test //:all --force_python=py3 

ToDo

  • Japanese example with accentual type labels
  • Vocoder parameter examples
  • WaveNet instruction

Licence

BSD 3-Clause License

Copyright (c) 2018, Yamagishi Laboratory, National Institute of Informatics All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

About

An implementation of "Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language" https://arxiv.org/abs/1810.11960

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published