Skip to content
Eren Gölge edited this page Jan 15, 2020 · 9 revisions

What are the requirements of a dataset for a good TTS model?

How can I train my own model?

  1. Check your dataset with notebooks under dataset_analysis. Use this notebook to find the right audio processing parameters. The best parameters are the ones with the best GL synthesis.
  2. Write your own dataset formatter in datasets/preprocess.py or format your dataset as one of the supported datasets like LJSpeech.
    • preprocessor parses the metadata file and converts a list of training samples.
  3. If you have a dataset with a different alphabet than English Latin, you need to add your alphabet in utils.text.symbols.
    • If you use phonemes for training and your language is supported here, you don't need to do that.
  4. Write your own text cleaner in utils.text.cleaners. It is not always necessary to expect you have a different alphabet or language-specific requirements.
    • This step is used to expand numbers, abbreviations and normalizing the text.
  5. Setup config.json for your dataset. Go over each parameter one by one and consider it regarding the commented explanation.
    • 'sample_rate', 'phoneme_language' (if phoneme enabled), 'output_path', 'datasets', 'text_cleaner' are the fields you need to edit in most of the cases.
  6. Write down your test sentences in a txt file as a sentence per line and set it in config.json test_sentences_file.
  7. Train your model.
    • SingleGPU training: python train.py --config_path config.json
    • MultiGPU training: CUDA_VISIBLE_DEVICES="0,1,2" python distribute.py --config_path config.json
      • This command uses all the GPUs given in CUDA_VISIBLE_DEVICES. If you don't specify, it uses all the GPUs available.

How can I train on a different language?

  • Check steps 2, 3, 4, 5 above.

How can I train multi-GPUs?

  • Check step 5 above.

How can I check model performance?

  • You can inspect model training and performance using tensorboard. It will show you loss values, attention alignments, model outputs. Go with the order below to verify the model.
  1. Check ground truth spectrograms. If they do not look as that are supposed to, then check audio processing parameters set in config.json.
  2. Check train and eval loss values and make sure that they all decrease smoothly in time.
  3. Check model spectrograms. Especially training outputs should converge to ground truth after 10K iterations.
  4. Your model would not work in test time until the attention has a near diagonal alignment. This is the sublime art of TTS training.
    • Attention should converge diagonally after 50K iterations.
    • If attention does not converge, the probabilities are;
      • Your dataset is too noisy or small.
      • Samples are too long.
      • Batch size is too small (batch_size < 32 would be having a hard time to converge)
    • You can also try other attention algorithms like 'graves', 'bidirectional_decoder', 'forward_attn'.
      • 'bidirectional_decoder' is your ultimate savior but it trains 2x slower and demands 1.5x more GPU memory.

My model does not learn. How can I debug?

  • Go over the steps under "How can I check model performance?"

Attention does not align. How can I make it work?

  • Check the 4th step under "How can I check model performance?"

How should I choose which model to use?

  • Train first Tacotron. It is smaller and faster to train. If it performs poorly, try Tacotron2.

How can I test a trained model?

  • The best way is to use Benchmark notebooks.
  • You can try synthesize.py .
  • You can try our demo server. It is quite limited to only demo purposes.

I downloaded a pre-trained model and it does not work due to a bunch of errors. What should I do?

  • Make sure you use the right commit version of TTS. Each pre-trained model has its corresponding version that needs to be used. It is defined on the model table.
  • If it is still problematic, go and post your problem on https://discourse.mozilla.org/c/tts/285 . Please give as many details as possible (error message, your TTS version, your TTS model and config.json etc.)
  • If you feel like it's a bug to be fixed, then prefer Github issues with the same level of scrutiny.

How does healthy training look on Tensorboard?

  • Check this issue to see an example of TB output.

My model does not stop - I see "Decoder stopped with 'max_decoder_steps" - Stopnet does not work.

  • In general, all of the above relates to the stopnet which is the part telling the network to stop inference.
  • In general, a poor stopnet relates to something else broken in your model, especially the attention module. So it is better to debug your model using the notebooks and the tensorboard outputs.
  • You can also play with the loss weight set in train.py which tries to balance out the improportion of stopnet labels. But the default value should work fine generally.
  • Another option is to use attention weigths to decide where to stop. If you attention weights looks good, then you can check if it reaches the end of the sentence and stop the inference.
Clone this wiki locally