You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey,
I am interested in training a one-language model, which also has less accent. I would like to train a good quality model on about 1000 speakers. Then I want to fine tune on a single speaker (like #437) (with 5 minutes or even hours of audio) to finally get a good single speaker model. Now my question is: Does the model benefit from an embedding size (or only hidden size in encoder) of 768 like sberryman did in #126, even if training time and Vram usage increases heavily? Or is it only interesting for multi-language/multi-accent models and I would definitely waste my time with that? Or even get worse results?
I also use 48000 as sample_rate, as most of my samples (of commonVoice) are in 48k, maybe this has an impact?
Thanks in advance :)
The text was updated successfully, but these errors were encountered:
I didn't notice any improvement in synthesizer quality when training with sberryman's encoder. It seems 768 is much too big for the number of speakers in the dataset.
You can use the table in 1806.04558 to inform further experimentation in this area.
My experience suggests that recording quality is much more important than sample rate.
Okay, thank you. My encoder training set will have around 10000 speakers max. Thus, I will definitely use 256 for the embedding size. According to the quality I totally agree, but as I already have 48k Audio, I thought audio quality might even be a bit better when I also use this rate in training (I will train a vocoder with 48k from scratch as well).
Hey,
I am interested in training a one-language model, which also has less accent. I would like to train a good quality model on about 1000 speakers. Then I want to fine tune on a single speaker (like #437) (with 5 minutes or even hours of audio) to finally get a good single speaker model. Now my question is: Does the model benefit from an embedding size (or only hidden size in encoder) of 768 like sberryman did in #126, even if training time and Vram usage increases heavily? Or is it only interesting for multi-language/multi-accent models and I would definitely waste my time with that? Or even get worse results?
I also use 48000 as sample_rate, as most of my samples (of commonVoice) are in 48k, maybe this has an impact?
Thanks in advance :)
The text was updated successfully, but these errors were encountered: