Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General question about embedding size #840

Closed
Bebaam opened this issue Sep 6, 2021 · 2 comments
Closed

General question about embedding size #840

Bebaam opened this issue Sep 6, 2021 · 2 comments

Comments

@Bebaam
Copy link

Bebaam commented Sep 6, 2021

Hey,
I am interested in training a one-language model, which also has less accent. I would like to train a good quality model on about 1000 speakers. Then I want to fine tune on a single speaker (like #437) (with 5 minutes or even hours of audio) to finally get a good single speaker model. Now my question is: Does the model benefit from an embedding size (or only hidden size in encoder) of 768 like sberryman did in #126, even if training time and Vram usage increases heavily? Or is it only interesting for multi-language/multi-accent models and I would definitely waste my time with that? Or even get worse results?
I also use 48000 as sample_rate, as most of my samples (of commonVoice) are in 48k, maybe this has an impact?

Thanks in advance :)

@ghost
Copy link

ghost commented Sep 8, 2021

I didn't notice any improvement in synthesizer quality when training with sberryman's encoder. It seems 768 is much too big for the number of speakers in the dataset.

You can use the table in 1806.04558 to inform further experimentation in this area.

image

My experience suggests that recording quality is much more important than sample rate.

@Bebaam
Copy link
Author

Bebaam commented Sep 8, 2021

Okay, thank you. My encoder training set will have around 10000 speakers max. Thus, I will definitely use 256 for the embedding size. According to the quality I totally agree, but as I already have 48k Audio, I thought audio quality might even be a bit better when I also use this rate in training (I will train a vocoder with 48k from scratch as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant