Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training 2-3 models, suggestions? #1157

Open
prakharpbuf opened this issue Jan 18, 2023 · 1 comment
Open

Training 2-3 models, suggestions? #1157

prakharpbuf opened this issue Jan 18, 2023 · 1 comment

Comments

@prakharpbuf
Copy link

prakharpbuf commented Jan 18, 2023

Hi,
Great work with Real Time Voice Cloning!

I already got some experience training the models. I fine tuned the model with one of the speakers from dev-clean LibriSpeech and successfully got noticeable improvement in the output quality.
Now I'm going to train two or three models:
1. Everything from scratch using LibriTTS dataset.
I know blue-fish (Now @ghost) and @mbdash tried to train a model using LibriTTS in #449 but the output did not
improve. They were still trying and moved the discussion to a slack channel so I don't know what the end result was. If
anyone knows what happened after they trained the new encoder and everything and can share the result (and even better,
the model) will be much appreciated!
After this model is trained, I might also fine tune it for my voice. (same as point 2. in this post)
2. Fine tune the pre trained model on 1 hour of my own voice.
I know it has been noted by blue-fish in #437 that fine tuning on your voice with 0.2hr of data and training for a few
thousand steps improves the quality of output for your voice. But I wonder what will happen if instead of 0.2hr, I use a whole
hour (maybe more) and train it for more than just a few thousand, maybe in the order of 10 thousands.
3. Maybe also a model using the pretrained model and train it for a few additional 100-200k steps using more data from
Mozilla Common Voice or something else.

Do you think this will be useful?

In #126, @sberryman trained all three models from scratch but he was not happy with the synthesizer and vocoder that he
trained. I'm not sure what it means because I don't have any experience with AI but he says that the synthesizer did not
align well? Also, he said the encoder was pretty good though. He has uploaded his models and the link still works, maybe I
can try something with the models he trained? He has a better encoder.

I don't have a very fast memory (I'll be using external hard drives)
I have NVIDIA Quadro P2000 GPUs
But, to train each model, I'll use separate PCs with same specs so they all train in parallel.

Any suggestions on playing around with hparams, different ideas for training, or anything else?
All suggestions are welcomed and appreciated.

Also, if you have any ides on training, lemme know. Like let me know what to train (one of the pretrained/train from scratch), what dataset to use, what hparams, and how much to train, and I'll do it. I have plenty of time.

Thanks!

@oops408
Copy link

oops408 commented Mar 15, 2023

try using model architecture (eg. location vs. contact-based) and loss functions as hparams and see if those help fine tune it. i'm trying out SGD optimization to see if that would improve the results. oh yeah, maybe pitch shifting would be interesting as well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants