Skip to content
Eren Gölge edited this page Jan 13, 2020 · 9 revisions

What are the requirements of a dataset for a good TTS model?

How can I train my own model?

  1. Check your dataset with notebooks under dataset_analysis. Use this notebook to find right audio processing parameters. The best parameters are the ones with the best GL synthesis.
  2. Write your own dataset formatter in datasets/preprocess.py or format your dataset as one of the supported datasets like LJSpeech.
    • preprocessor parses the metadata file and converts a list of training samples.
  3. If you have a dataset with a different alphabet than English Latin, you need to add your alphabet in utils.text.symbols.
    • If you use phonemes for training and your language is supported here, you don't need to do that.
  4. Write your own text cleaner in utils.text.cleaners. It is not always necessary expect you have a different alphabet or language specific requirements.
    • This step is used to expand numbers, abbriviations and normalizing the text.
  5. Setup config.json for your dataset. Go over each parameter one by one and consider it regarding the commented explanation.
    • 'sample_rate', 'phoneme_language' (if phoneme enabled), 'output_path', 'datasets', 'text_cleaner' are the fields you need to edit in most of the cases.
  6. Write down your test sentences in a txt file as a sentence per line and set it in config.json test_sentences_file.
  7. Train your model.
    • SingleGPU training: python train.py --config_path config.json
    • MultiGPU training: CUDA_VISIBLE_DEVICES="0,1,2" python distribute.py --config_path config.json
      • This command uses all the GPUs given in CUDA_VISIBLE_DEVICES. If you don't specify, it uses all the GPUs available.

How can I train on a different language?

  • Check steps 2, 3, 4, 5 above.

How can I train multi-GPUs?

  • Check step 5 above.

How can I check model performance?

  • You can inspect model training and performance using tensorboard. It will show you loss values, attention alignments, model outputs. Go with the order below to verify the model.
  1. Check ground truth spectrograms. If they do not look as that are supposed to, then check audio processing parameters set in config.json.
  2. Check train and eval loss values and make sure that they all decrease smoothly in time.
  3. Check model spectrograms. Especially training outputs should converge to ground truth after 10K iterations.
  4. Your model would not work in test time until the attention has a near diagonal alignment. This is the sublime art of TTS training.
    • Attention should converge diagonally after 50K iterations.
    • If attention does not converge, the probabilities are;
      • Your dataset is too noisy or small.
      • Samples are too long.
      • Batch size is too small (batch_size < 32 would be having hard time to converge)
    • You can also try other attention algorithms like 'graves', 'bidirectional_decoder', 'forward_attn'.
      • 'bidirectional_decoder' is your ultimate savior but it traines 2x slower and demands 1.5x more GPU memory.

My model does not learn. How can I debug?

  • Go over the steps under "How can I check model performance?"

Attention does not align. How can I make it work?

  • Check the 4th step under "How can I check model performance?"

How should I choose which model to use?

  • Train first Tacotron. It is smaller and faster to train. If it performs poorly, try Tacotron2.

How can I test a trained model?

  • The best way is to use Benchmark notebooks.
  • You can try synthesize.py .
  • You can try our demo server. It is quite limited to only demo purposes.
Clone this wiki locally