num_steps of training for those demo sample? #5

bayesrule · 2019-07-16T16:51:40Z

Hi,

This repo is really great. May I ask the number of training steps (with batch_size 32) required for your demo samples? Given the amount of training data used here (around 26 hours recordings), I guess the 100k num_steps as provided in the config.json is not enough, right?

Many thanks!

bshall · 2019-07-17T09:31:34Z

Hi @bayesrule,

Thanks! The audio on the demo page is generated with the pretrained model I uploaded which was only trained for 100k steps. I was also surprised by how quickly it trains. You get intelligible samples by 20k steps and decent results by 60k-80k steps.

I've noticed that generated audio for the out-of-domain speakers are a bit noisy. I'm not sure if longer training times would help with that or if it is a limitation with the ZeroSpeech dataset (which is pretty noisy).

te0006 · 2019-09-05T21:13:08Z

Hi @bshall,

I was also surprised by how quickly it trains.
Could you share some data points w.r.t. absolute training time vs. corpus size and hardware used?
Im building a TTS prototype based on Tacotron and am looking for a vocoder with better quality than GL but less training effort than required e.g. by Wavenet.
Thanks!

tarepan · 2019-09-05T23:46:29Z

Hi @te0006
I share my results.
I am grad if it is good for you.

https://tarepan.github.io/UniversalVocoding/

Dataset: total 10 hours utterances
Machine: Google Colab T4
others: in GitHub Pages

In my impression, RNN_MS is surprisingly fast and robust.

te0006 · 2019-09-06T07:37:42Z

Hello, thanks for replying so quickly.

For such a short training run (5hrs/60ksteps) your results certainly sound impressive.

I think training time is often neglected in publications even though it is critically important for people looking to integrate/adapt a method, where you want to be able to try and fiddle with parameters without prohibitive computational cost.

BTW your last, English sound example seems to exhibit considerably more noise and distortion than the Japanese ones (but perhaps, not speaking the language and thus not being used to hearing it, I simply cannot hear the artifacts in the Japanese examples).

Do you already have experience w.r.t how far (and how fast) the speech quality improves with more training time?

tarepan · 2019-09-07T09:04:36Z

Many reproducible experiments (including this repository) kindly give information of training time. I agree with you and hope papers itself give the information too.

Your hearing is correct.
Out of domain En utterance is more noisy.
In my opinion, it is because of language difference.
English coutain phonemes which are not in Japanese.

Not yet, but I will.

bshall added the question Further information is requested label Aug 22, 2019

bshall mentioned this issue Jul 26, 2020

How long does it takes to train from the scratch? #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_steps of training for those demo sample? #5

num_steps of training for those demo sample? #5

bayesrule commented Jul 16, 2019 •

edited

Loading

bshall commented Jul 17, 2019

te0006 commented Sep 5, 2019

tarepan commented Sep 5, 2019

te0006 commented Sep 6, 2019

tarepan commented Sep 7, 2019

num_steps of training for those demo sample? #5

num_steps of training for those demo sample? #5

Comments

bayesrule commented Jul 16, 2019 • edited Loading

bshall commented Jul 17, 2019

te0006 commented Sep 5, 2019

tarepan commented Sep 5, 2019

te0006 commented Sep 6, 2019

tarepan commented Sep 7, 2019

bayesrule commented Jul 16, 2019 •

edited

Loading