-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretrained Models Using Datasets Other Than LibriSpeech? #877
Comments
To date, no one has shared an alternative pretrained model that is compatible with the current (pytorch) synthesizer. If you're willing to switch back to tensorflow 1.x, there are a few in #400 including one model on LibriTTS. However, you can consider training from scratch on LibriTTS with the current repo, since you have experience with the single voice finetuning. |
@blue-fish , thank you for the reply. If I were to try to train all three models (voice encoder, synthesizer and vocoder) from scratch, using LibriTTS, would you recommend using train-clean-100 or train-clean-500? My understanding from reading the doctoral papers and Corentin's remarks, is that for voice encoder you need lots of voices and quality is less important than quantity and for synthesizer and vocoder quality>quantity. If I were to do this, training the synthesizer alone would take a week, but I may give it a go. Any hints, tips or settings for hparams you could share for such a project would be greatly appreciated. If I decide to try this, I would shoot for 300k steps to get to a 1e05 learning rate. Also, I have not tried any voice encoder training yet using this repo. Any helpful information or hparams for that evolution you could share? I need to do a bit of research first on LibriTTS to see what it can and cannot do wrt punctuation. If it will be no better than the current LibriSpeech trained model, it may not be worth the time or effort. Your thoughts would be appreciated. |
You can reuse the existing encoder and vocoder models. When training the synthesizer, make sure not to change the audio settings in the synthesizer hparams. Our observations on LibriTTS are in #449. Since this is your first time training a model from scratch, I suggest decreasing the model dimensions, and use a larger reduction factor. This will help the model train faster, at the expense of quality. When you are confident things are working, revert to the defaults.
|
You can also decrease |
@blue-fish , thanks for the reply. If I decide to go forward on this effort, I would plan to use train-clean-360. Easier to download, smaller size. After reading #449 , I agree that limiting max_mel_frames to 500 is a good idea. Thanks also for the accelerated training hparams info. |
I suggest using both train-clean-100 and 360 to more closely match the training of the pretrained models. If you decide to pursue this, good luck and please consider sharing your models. |
@blue-fish said: #I suggest using both train-clean-100 and 360 to more closely match the training of the pretrained models.# How can I use both? Do I run training for 100k steps on train-cleanl100 then train another 200k steps using train-clean-360 on top of that? Or can I simply combine them both together in my datasets_root and train the combination once to 300k steps? #If you decide to pursue this, good luck and please consider sharing your models.# |
Exactly. |
@Tomcattwo Did you end up pursuing this? If yes, how is the training coming along? |
@blue-fish , I have not started on this project yet. I have a few other projects (semi-related) working now. I read the TTS Corpus paper and it sounds interesting. Frankly I have gotten very good results from the single voice trained models I am using for my current prohect, but there's always room for improvement. I would love to be able to "help" the synthesizer using punctuation to tell it where to place the emphasis on a syllable or syllables in multi-syllabic a word... |
Pleased to know that you are satisfied with the single voice models. Please reopen this issue if you start training a model from scratch. |
Hello all,
@blue-fish , I had very good success on my project to clone 14 voices from a computer simulation (samples available here ) using single-voice training (5000 additional steps) on the LibriSpeech pretrained synthesizer (295K) and Vocoder.
However, I would like to see if another model (in English) might provide better output reproducibility, and perhaps punctuation recognition and some better degree of emotion (perhaps with LibriTTS or some newer corpus that I am not aware of yet). Are you aware of any pretrained speech encoder/synthesizer/vocoder models built on another dataset that might be available for download? I tried building synthesizer and vocoder single voice training on the synthesizer LibriTTS model from your single voice training instructions, but only got garbled output in the demo_toolbox, probably due to the fact that the speech encoder was built on LibriSpeech and not on LibriTTS . Any info you or anyone else might have on a potential model-set download would be greatly appreciated.
Thanks in advance,
Tomcattwo
The text was updated successfully, but these errors were encountered: