Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot train vocoder for single-speaker fine tuning; vocoder_preprocess is not generating any mel gtas #653

Closed
StElysse opened this issue Feb 10, 2021 · 9 comments

Comments

@StElysse
Copy link

StElysse commented Feb 10, 2021

I know vritually nothing about programming and am trying to use the tool. I got the voice cloning to work, but I would like to go further and fine-tune a model based on #437. I'm only using a small number of steps just to familiarise myself with the process before committing to using a larger set of utterances and training steps. I made a dataset of 50 utterances, trained the synthesizer to 50 steps as a test run, and tried it out on the toolbox. The positive difference compared to the base pretrained model is staggering and very noticeable!

I want to train the vocoder, but I am stuck on this step:

"Stop training once satisfied with the resulting model. At this point you can fine-tune your vocoder model. First generate the training data for the vocoder" So, I use the following commands:

python vocoder_preprocess.py synthesizer/saved_models/logs-singlespeaker/datasets_root/SV2TTS/synthesizer (This command cannot seem to find train.txt)
python vocoder_preprocess.py synthesizer/saved_models/logs-singlespeaker/datasets_root/ (This one works, but it does not seem to generate any mel gta files)

The command runs, but seems to generate no audios or mel gtas in the resulting vocoder folder. PowerShell says this:

Starting Synthesis
0it [00:00, ?it/s]
Synthesized mel spectrograms at synthesizer/saved_models/logs-singlespeaker/datasets_root/SV2TTS\vocoder\mels_gta

Unsurprsingly, I can't train the vocoder with vocoder_train.py because there are no audios or mel gta files, or entries in synthesizer.txt.

What am I doing wrong? What should do to make sure that the mel gtas are generated?

Thanks for this awesome project.

@ghost
Copy link

ghost commented Feb 10, 2021

I wrote #437 and do not recommend finetuning the vocoder. The quality doesn't seem to improve. Instead, try using my 1159k vocoder model for slightly better quality. #126 (comment)

But if you insist, here's an easy way to work around your problem. Copy SV2TTS/synthesizer/train.txt to SV2TTS/vocoder/synthesized.txt. Also copy the SV2TTS/synthesizer/mels folder to SV2TTS/vocoder/mels_gta. This copies your ground truth (GT) mels to the area where the vocoder expects ground truth-aligned (GTA) mels.

@StElysse
Copy link
Author

StElysse commented Feb 10, 2021

Thank you so much for your help! I now know I have an option when deciding how to go about the vocoder. The 1159k model you've shared is better.

I have one last question: Will hardcoding all utterances to a single embedding reduce noise and artifacts? If so, how do I do that?

@ghost
Copy link

ghost commented Feb 11, 2021

Probably not. The vocoder is responsible for most of the noise and artifacts. (Switch to Griffin-Lim and you'll get clean output, though distorted.)

@StElysse
Copy link
Author

StElysse commented Feb 12, 2021

I see, thank you.

I notice on Audacity that several of my utterances contain inhalation after some words or phrases. Some also have long periods of silence between phrases or words.

To avoid awkward silences and other potential sources of noise and artifacts, should I clean these things up from the audio files before creating and training the datasets? Thank you!

@ghost
Copy link

ghost commented Feb 13, 2021

The quality will improve somewhat if you clean your dataset for finetuning. But recognize that the base model is trained on imperfect data, and some of that will transfer. #364 (comment)

@StElysse
Copy link
Author

Thank you.

At the risk of straying slightly off topic, I have another question: The dataset I am currently using has about 450 utterances. I have more, but I am running into pesky Out Of Memory issues. A successful workaround was changing the training batches from 36 to 32.

I have far, far more utterances from the same speaker that I want to train the model on for single-speaker fine-tuning. Should I create another dataset and swap it out with the first one, and training the model on that? What other suggestions might you have, in the context of the OoM issue I have mentioned?

@ghost
Copy link

ghost commented Feb 13, 2021

More data is better. There is no harm in leaving the batch size at 32. More utterances means it requires more steps to complete the epoch.

If your speech data is high quality and you have at least 8 hours, consider training your synthesizer and vocoder models from scratch for best results. Finetuning is a shortcut for those who don't enough data, time, or audio quality to train a better model.

@StElysse
Copy link
Author

I have gotten the avg_loss to 0.435 with a few thousand steps, but it is starting to just hover at that point. Have I hit the limit for refining this tool, or will adding more datasets from the same speaker improve results in other ways?

@ghost
Copy link

ghost commented Feb 17, 2021

The model cannot perfectly predict spectrograms without overfitting, so the loss will converge to a nonzero number. In my experience it is around 0.4 for LibriSpeech.

0.4 is the summation of L2 loss from decoder and postnet mels, and L1 loss of the decoder mels. You can see it in the code here:

# Backward pass
m1_loss = F.mse_loss(m1_hat, mels) + F.l1_loss(m1_hat, mels)
m2_loss = F.mse_loss(m2_hat, mels)
stop_loss = F.binary_cross_entropy(stop_pred, stop)
loss = m1_loss + m2_loss + stop_loss

Results will get better with: more data + less noise + consistent prosody + consistent volume + consistent recording conditions

You can also experiment with adjusting the network size but that will require training a model from scratch.

@ghost ghost closed this as completed Feb 17, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant