Cannot train vocoder for single-speaker fine tuning; vocoder_preprocess is not generating any mel gtas #653

StElysse · 2021-02-10T04:23:15Z

I know vritually nothing about programming and am trying to use the tool. I got the voice cloning to work, but I would like to go further and fine-tune a model based on #437. I'm only using a small number of steps just to familiarise myself with the process before committing to using a larger set of utterances and training steps. I made a dataset of 50 utterances, trained the synthesizer to 50 steps as a test run, and tried it out on the toolbox. The positive difference compared to the base pretrained model is staggering and very noticeable!

I want to train the vocoder, but I am stuck on this step:

"Stop training once satisfied with the resulting model. At this point you can fine-tune your vocoder model. First generate the training data for the vocoder" So, I use the following commands:

python vocoder_preprocess.py synthesizer/saved_models/logs-singlespeaker/datasets_root/SV2TTS/synthesizer (This command cannot seem to find train.txt)
python vocoder_preprocess.py synthesizer/saved_models/logs-singlespeaker/datasets_root/ (This one works, but it does not seem to generate any mel gta files)

The command runs, but seems to generate no audios or mel gtas in the resulting vocoder folder. PowerShell says this:

Starting Synthesis
0it [00:00, ?it/s]
Synthesized mel spectrograms at synthesizer/saved_models/logs-singlespeaker/datasets_root/SV2TTS\vocoder\mels_gta

Unsurprsingly, I can't train the vocoder with vocoder_train.py because there are no audios or mel gta files, or entries in synthesizer.txt.

What am I doing wrong? What should do to make sure that the mel gtas are generated?

Thanks for this awesome project.

ghost · 2021-02-10T07:00:53Z

I wrote #437 and do not recommend finetuning the vocoder. The quality doesn't seem to improve. Instead, try using my 1159k vocoder model for slightly better quality. #126 (comment)

But if you insist, here's an easy way to work around your problem. Copy SV2TTS/synthesizer/train.txt to SV2TTS/vocoder/synthesized.txt. Also copy the SV2TTS/synthesizer/mels folder to SV2TTS/vocoder/mels_gta. This copies your ground truth (GT) mels to the area where the vocoder expects ground truth-aligned (GTA) mels.

StElysse · 2021-02-10T14:21:43Z

Thank you so much for your help! I now know I have an option when deciding how to go about the vocoder. The 1159k model you've shared is better.

I have one last question: Will hardcoding all utterances to a single embedding reduce noise and artifacts? If so, how do I do that?

ghost · 2021-02-11T07:15:16Z

Probably not. The vocoder is responsible for most of the noise and artifacts. (Switch to Griffin-Lim and you'll get clean output, though distorted.)

StElysse · 2021-02-12T14:06:15Z

I see, thank you.

I notice on Audacity that several of my utterances contain inhalation after some words or phrases. Some also have long periods of silence between phrases or words.

To avoid awkward silences and other potential sources of noise and artifacts, should I clean these things up from the audio files before creating and training the datasets? Thank you!

ghost · 2021-02-13T10:54:09Z

The quality will improve somewhat if you clean your dataset for finetuning. But recognize that the base model is trained on imperfect data, and some of that will transfer. #364 (comment)

StElysse · 2021-02-13T14:01:53Z

Thank you.

At the risk of straying slightly off topic, I have another question: The dataset I am currently using has about 450 utterances. I have more, but I am running into pesky Out Of Memory issues. A successful workaround was changing the training batches from 36 to 32.

I have far, far more utterances from the same speaker that I want to train the model on for single-speaker fine-tuning. Should I create another dataset and swap it out with the first one, and training the model on that? What other suggestions might you have, in the context of the OoM issue I have mentioned?

ghost · 2021-02-13T20:20:49Z

More data is better. There is no harm in leaving the batch size at 32. More utterances means it requires more steps to complete the epoch.

If your speech data is high quality and you have at least 8 hours, consider training your synthesizer and vocoder models from scratch for best results. Finetuning is a shortcut for those who don't enough data, time, or audio quality to train a better model.

StElysse · 2021-02-14T02:23:50Z

I have gotten the avg_loss to 0.435 with a few thousand steps, but it is starting to just hover at that point. Have I hit the limit for refining this tool, or will adding more datasets from the same speaker improve results in other ways?

ghost · 2021-02-17T20:36:05Z

The model cannot perfectly predict spectrograms without overfitting, so the loss will converge to a nonzero number. In my experience it is around 0.4 for LibriSpeech.

0.4 is the summation of L2 loss from decoder and postnet mels, and L1 loss of the decoder mels. You can see it in the code here:

Real-Time-Voice-Cloning/synthesizer/train.py

Lines 179 to 184 in 10ca8f7

    
           # Backward pass 
        
           m1_loss = F.mse_loss(m1_hat, mels) + F.l1_loss(m1_hat, mels) 
        
           m2_loss = F.mse_loss(m2_hat, mels) 
        
           stop_loss = F.binary_cross_entropy(stop_pred, stop) 
        
           loss = m1_loss + m2_loss + stop_loss

Results will get better with: more data + less noise + consistent prosody + consistent volume + consistent recording conditions

You can also experiment with adjusting the network size but that will require training a model from scratch.

ghost closed this as completed Feb 17, 2021

chankl3579 mentioned this issue Apr 5, 2021

Loss of Tacotron #725

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot train vocoder for single-speaker fine tuning; vocoder_preprocess is not generating any mel gtas #653

Cannot train vocoder for single-speaker fine tuning; vocoder_preprocess is not generating any mel gtas #653

StElysse commented Feb 10, 2021 •

edited

Loading

ghost commented Feb 10, 2021

StElysse commented Feb 10, 2021 •

edited

Loading

ghost commented Feb 11, 2021

StElysse commented Feb 12, 2021 •

edited

Loading

ghost commented Feb 13, 2021

StElysse commented Feb 13, 2021

ghost commented Feb 13, 2021

StElysse commented Feb 14, 2021

ghost commented Feb 17, 2021 •

edited by ghost

Loading

Cannot train vocoder for single-speaker fine tuning; vocoder_preprocess is not generating any mel gtas #653

Cannot train vocoder for single-speaker fine tuning; vocoder_preprocess is not generating any mel gtas #653

Comments

StElysse commented Feb 10, 2021 • edited Loading

ghost commented Feb 10, 2021

StElysse commented Feb 10, 2021 • edited Loading

ghost commented Feb 11, 2021

StElysse commented Feb 12, 2021 • edited Loading

ghost commented Feb 13, 2021

StElysse commented Feb 13, 2021

ghost commented Feb 13, 2021

StElysse commented Feb 14, 2021

ghost commented Feb 17, 2021 • edited by ghost Loading

StElysse commented Feb 10, 2021 •

edited

Loading

StElysse commented Feb 10, 2021 •

edited

Loading

StElysse commented Feb 12, 2021 •

edited

Loading

ghost commented Feb 17, 2021 •

edited by ghost

Loading