Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained Models Using Datasets Other Than LibriSpeech? #877

Closed
Tomcattwo opened this issue Oct 24, 2021 · 11 comments
Closed

Pretrained Models Using Datasets Other Than LibriSpeech? #877

Tomcattwo opened this issue Oct 24, 2021 · 11 comments

Comments

@Tomcattwo
Copy link
Contributor

Hello all,
@blue-fish , I had very good success on my project to clone 14 voices from a computer simulation (samples available here ) using single-voice training (5000 additional steps) on the LibriSpeech pretrained synthesizer (295K) and Vocoder.

However, I would like to see if another model (in English) might provide better output reproducibility, and perhaps punctuation recognition and some better degree of emotion (perhaps with LibriTTS or some newer corpus that I am not aware of yet). Are you aware of any pretrained speech encoder/synthesizer/vocoder models built on another dataset that might be available for download? I tried building synthesizer and vocoder single voice training on the synthesizer LibriTTS model from your single voice training instructions, but only got garbled output in the demo_toolbox, probably due to the fact that the speech encoder was built on LibriSpeech and not on LibriTTS . Any info you or anyone else might have on a potential model-set download would be greatly appreciated.
Thanks in advance,
Tomcattwo

@ghost
Copy link

ghost commented Oct 29, 2021

To date, no one has shared an alternative pretrained model that is compatible with the current (pytorch) synthesizer. If you're willing to switch back to tensorflow 1.x, there are a few in #400 including one model on LibriTTS. However, you can consider training from scratch on LibriTTS with the current repo, since you have experience with the single voice finetuning.

@Tomcattwo
Copy link
Contributor Author

@blue-fish , thank you for the reply. If I were to try to train all three models (voice encoder, synthesizer and vocoder) from scratch, using LibriTTS, would you recommend using train-clean-100 or train-clean-500? My understanding from reading the doctoral papers and Corentin's remarks, is that for voice encoder you need lots of voices and quality is less important than quantity and for synthesizer and vocoder quality>quantity. If I were to do this, training the synthesizer alone would take a week, but I may give it a go.

Any hints, tips or settings for hparams you could share for such a project would be greatly appreciated. If I decide to try this, I would shoot for 300k steps to get to a 1e05 learning rate. Also, I have not tried any voice encoder training yet using this repo. Any helpful information or hparams for that evolution you could share?

I need to do a bit of research first on LibriTTS to see what it can and cannot do wrt punctuation. If it will be no better than the current LibriSpeech trained model, it may not be worth the time or effort. Your thoughts would be appreciated.
Regards,
TC2

@ghost
Copy link

ghost commented Oct 29, 2021

You can reuse the existing encoder and vocoder models. When training the synthesizer, make sure not to change the audio settings in the synthesizer hparams.

Our observations on LibriTTS are in #449.

Since this is your first time training a model from scratch, I suggest decreasing the model dimensions, and use a larger reduction factor. This will help the model train faster, at the expense of quality. When you are confident things are working, revert to the defaults.

tts_embed_dims = 256,
tts_postnet_dims = 256,
tts_lstm_dims = 512,

tts_schedule = [(5,  1e-3,  20_000,  26),   # Progressive training schedule
                (5,  5e-4,  40_000,  26),   # (r, lr, step, batch_size)
                (5,  2e-4,  80_000,  26),   #
                (5,  1e-4, 160_000,  26),   # r = reduction factor (# of mel frames
                (5,  3e-5, 320_000,  26),   #     synthesized for each decoder iteration)
                (5,  1e-5, 640_000,  26)],  # lr = learning rate

@ghost
Copy link

ghost commented Oct 29, 2021

You can also decrease max_mel_frames to a lower number (like 500) to discard longer utterances. This will also increase training speed.

@Tomcattwo
Copy link
Contributor Author

@blue-fish , thanks for the reply. If I decide to go forward on this effort, I would plan to use train-clean-360. Easier to download, smaller size. After reading #449 , I agree that limiting max_mel_frames to 500 is a good idea. Thanks also for the accelerated training hparams info.
R/
TC2

@ghost
Copy link

ghost commented Oct 29, 2021

I suggest using both train-clean-100 and 360 to more closely match the training of the pretrained models. If you decide to pursue this, good luck and please consider sharing your models.

@Tomcattwo
Copy link
Contributor Author

@blue-fish said: #I suggest using both train-clean-100 and 360 to more closely match the training of the pretrained models.#

How can I use both? Do I run training for 100k steps on train-cleanl100 then train another 200k steps using train-clean-360 on top of that? Or can I simply combine them both together in my datasets_root and train the combination once to 300k steps?

#If you decide to pursue this, good luck and please consider sharing your models.#
Absolutely, assuming that the models come out sounding good. Happy to share plots, mid-training .wavs etc. upon request.
Regards,
TC2

@ghost
Copy link

ghost commented Oct 31, 2021

How can I use both? Combine them both together in my datasets_root and train the combination once to 300k steps?

Exactly.

@ghost
Copy link

ghost commented Nov 7, 2021

@Tomcattwo Did you end up pursuing this? If yes, how is the training coming along?

@Tomcattwo
Copy link
Contributor Author

@blue-fish , I have not started on this project yet. I have a few other projects (semi-related) working now. I read the TTS Corpus paper and it sounds interesting. Frankly I have gotten very good results from the single voice trained models I am using for my current prohect, but there's always room for improvement. I would love to be able to "help" the synthesizer using punctuation to tell it where to place the emphasis on a syllable or syllables in multi-syllabic a word...
I would like to give a TTS-built-from-scratch synthesizer base a try once I get some of these other projects behind me. I will let you know when I start and I will keep you apprised of progress. No doubt I will hit some snags and will solicit your always-helpful advice.
Regards,
Tomcattwo

@ghost
Copy link

ghost commented Nov 8, 2021

Pleased to know that you are satisfied with the single voice models. Please reopen this issue if you start training a model from scratch.

@ghost ghost closed this as completed Nov 8, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant