-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot train vocoder for single-speaker fine tuning; vocoder_preprocess is not generating any mel gtas #653
Comments
I wrote #437 and do not recommend finetuning the vocoder. The quality doesn't seem to improve. Instead, try using my 1159k vocoder model for slightly better quality. #126 (comment) But if you insist, here's an easy way to work around your problem. Copy |
Thank you so much for your help! I now know I have an option when deciding how to go about the vocoder. The 1159k model you've shared is better. I have one last question: Will hardcoding all utterances to a single embedding reduce noise and artifacts? If so, how do I do that? |
Probably not. The vocoder is responsible for most of the noise and artifacts. (Switch to Griffin-Lim and you'll get clean output, though distorted.) |
I see, thank you. I notice on Audacity that several of my utterances contain inhalation after some words or phrases. Some also have long periods of silence between phrases or words. To avoid awkward silences and other potential sources of noise and artifacts, should I clean these things up from the audio files before creating and training the datasets? Thank you! |
The quality will improve somewhat if you clean your dataset for finetuning. But recognize that the base model is trained on imperfect data, and some of that will transfer. #364 (comment) |
Thank you. At the risk of straying slightly off topic, I have another question: The dataset I am currently using has about 450 utterances. I have more, but I am running into pesky Out Of Memory issues. A successful workaround was changing the training batches from 36 to 32. I have far, far more utterances from the same speaker that I want to train the model on for single-speaker fine-tuning. Should I create another dataset and swap it out with the first one, and training the model on that? What other suggestions might you have, in the context of the OoM issue I have mentioned? |
More data is better. There is no harm in leaving the batch size at 32. More utterances means it requires more steps to complete the epoch. If your speech data is high quality and you have at least 8 hours, consider training your synthesizer and vocoder models from scratch for best results. Finetuning is a shortcut for those who don't enough data, time, or audio quality to train a better model. |
I have gotten the avg_loss to 0.435 with a few thousand steps, but it is starting to just hover at that point. Have I hit the limit for refining this tool, or will adding more datasets from the same speaker improve results in other ways? |
The model cannot perfectly predict spectrograms without overfitting, so the loss will converge to a nonzero number. In my experience it is around 0.4 for LibriSpeech. 0.4 is the summation of L2 loss from decoder and postnet mels, and L1 loss of the decoder mels. You can see it in the code here: Real-Time-Voice-Cloning/synthesizer/train.py Lines 179 to 184 in 10ca8f7
Results will get better with: more data + less noise + consistent prosody + consistent volume + consistent recording conditions You can also experiment with adjusting the network size but that will require training a model from scratch. |
I know vritually nothing about programming and am trying to use the tool. I got the voice cloning to work, but I would like to go further and fine-tune a model based on #437. I'm only using a small number of steps just to familiarise myself with the process before committing to using a larger set of utterances and training steps. I made a dataset of 50 utterances, trained the synthesizer to 50 steps as a test run, and tried it out on the toolbox. The positive difference compared to the base pretrained model is staggering and very noticeable!
I want to train the vocoder, but I am stuck on this step:
"Stop training once satisfied with the resulting model. At this point you can fine-tune your vocoder model. First generate the training data for the vocoder" So, I use the following commands:
python vocoder_preprocess.py synthesizer/saved_models/logs-singlespeaker/datasets_root/SV2TTS/synthesizer (This command cannot seem to find train.txt)
python vocoder_preprocess.py synthesizer/saved_models/logs-singlespeaker/datasets_root/ (This one works, but it does not seem to generate any mel gta files)
The command runs, but seems to generate no audios or mel gtas in the resulting vocoder folder. PowerShell says this:
Starting Synthesis
0it [00:00, ?it/s]
Synthesized mel spectrograms at synthesizer/saved_models/logs-singlespeaker/datasets_root/SV2TTS\vocoder\mels_gta
Unsurprsingly, I can't train the vocoder with vocoder_train.py because there are no audios or mel gta files, or entries in synthesizer.txt.
What am I doing wrong? What should do to make sure that the mel gtas are generated?
Thanks for this awesome project.
The text was updated successfully, but these errors were encountered: