Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_steps of training for those demo sample? #5

Open
bayesrule opened this issue Jul 16, 2019 · 5 comments
Open

num_steps of training for those demo sample? #5

bayesrule opened this issue Jul 16, 2019 · 5 comments
Labels
question Further information is requested

Comments

@bayesrule
Copy link

bayesrule commented Jul 16, 2019

Hi,

This repo is really great. May I ask the number of training steps (with batch_size 32) required for your demo samples? Given the amount of training data used here (around 26 hours recordings), I guess the 100k num_steps as provided in the config.json is not enough, right?

Many thanks!

@bshall
Copy link
Owner

bshall commented Jul 17, 2019

Hi @bayesrule,

Thanks! The audio on the demo page is generated with the pretrained model I uploaded which was only trained for 100k steps. I was also surprised by how quickly it trains. You get intelligible samples by 20k steps and decent results by 60k-80k steps.

I've noticed that generated audio for the out-of-domain speakers are a bit noisy. I'm not sure if longer training times would help with that or if it is a limitation with the ZeroSpeech dataset (which is pretty noisy).

@bshall bshall added the question Further information is requested label Aug 22, 2019
@te0006
Copy link

te0006 commented Sep 5, 2019

Hi @bshall,

I was also surprised by how quickly it trains.
Could you share some data points w.r.t. absolute training time vs. corpus size and hardware used?
Im building a TTS prototype based on Tacotron and am looking for a vocoder with better quality than GL but less training effort than required e.g. by Wavenet.
Thanks!

@tarepan
Copy link

tarepan commented Sep 5, 2019

Hi @te0006
I share my results.
I am grad if it is good for you.

https://tarepan.github.io/UniversalVocoding/

Dataset: total 10 hours utterances
Machine: Google Colab T4
others: in GitHub Pages

In my impression, RNN_MS is surprisingly fast and robust.

@te0006
Copy link

te0006 commented Sep 6, 2019

Hello, thanks for replying so quickly.

For such a short training run (5hrs/60ksteps) your results certainly sound impressive.

I think training time is often neglected in publications even though it is critically important for people looking to integrate/adapt a method, where you want to be able to try and fiddle with parameters without prohibitive computational cost.

BTW your last, English sound example seems to exhibit considerably more noise and distortion than the Japanese ones (but perhaps, not speaking the language and thus not being used to hearing it, I simply cannot hear the artifacts in the Japanese examples).

Do you already have experience w.r.t how far (and how fast) the speech quality improves with more training time?

@tarepan
Copy link

tarepan commented Sep 7, 2019

Many reproducible experiments (including this repository) kindly give information of training time. I agree with you and hope papers itself give the information too.

Your hearing is correct.
Out of domain En utterance is more noisy.
In my opinion, it is because of language difference.
English coutain phonemes which are not in Japanese.

Not yet, but I will.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants