-
Notifications
You must be signed in to change notification settings - Fork 1.3k
FAQ
Eren Gölge edited this page Jan 13, 2020
·
9 revisions
- Check your dataset with notebooks under dataset_analysis. Use this notebook to find right audio processing parameters. The best parameters are the ones with the best GL synthesis.
- Write your own dataset formatter in datasets/preprocess.py or format your dataset as one of the supported datasets like LJSpeech.
- preprocessor parses the metadata file and converts a list of training samples.
- If you have a dataset with a different alphabet than English Latin, you need to add your alphabet in
utils.text.symbols
.- If you use phonemes for training and your language is supported here, you don't need to do that.
- Write your own text cleaner in
utils.text.cleaners
. It is not always necessary expect you have a different alphabet or language specific requirements.- This step is used to expand numbers, abbriviations and normalizing the text.
- Setup config.json for your dataset. Go over each parameter one by one and consider it regarding the commented explanation.
- 'sample_rate', 'phoneme_language' (if phoneme enabled), 'output_path', 'datasets', 'text_cleaner' are the fields you need to edit in most of the cases.
- Write down your test sentences in a txt file as a sentence per line and set it in config.json
test_sentences_file
. - Train your model.
- SingleGPU training:
python train.py --config_path config.json
- MultiGPU training:
CUDA_VISIBLE_DEVICES="0,1,2" python distribute.py --config_path config.json
- This command uses all the GPUs given in
CUDA_VISIBLE_DEVICES
. If you don't specify, it uses all the GPUs available.
- This command uses all the GPUs given in
- SingleGPU training:
- Check steps 2, 3, 4, 5 above.
- Check step 5 above.
- You can inspect model training and performance using tensorboard. It will show you loss values, attention alignments, model outputs. Go with the order below to verify the model.
- Check ground truth spectrograms. If they do not look as that are supposed to, then check audio processing parameters set in config.json.
- Check train and eval loss values and make sure that they all decrease smoothly in time.
- Check model spectrograms. Especially training outputs should converge to ground truth after 10K iterations.
- Your model would not work in test time until the attention has a near diagonal alignment. This is the sublime art of TTS training.
- Attention should converge diagonally after 50K iterations.
- If attention does not converge, the probabilities are;
- Your dataset is too noisy or small.
- Samples are too long.
- Batch size is too small (batch_size < 32 would be having hard time to converge)
- You can also try other attention algorithms like 'graves', 'bidirectional_decoder', 'forward_attn'.
- 'bidirectional_decoder' is your ultimate savior but it traines 2x slower and demands 1.5x more GPU memory.
- Go over the steps under "How can I check model performance?"
- Check the 4th step under "How can I check model performance?"
- Train first Tacotron. It is smaller and faster to train. If it performs poorly, try Tacotron2.
- The best way is to use Benchmark notebooks.
- You can try
synthesize.py
. - You can try our demo server. It is quite limited to only demo purposes.