Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make a high quality public domain training set using mozilla deepspeech and librivox (idea\enhancement) #34

Closed
Motherboard opened this issue May 23, 2018 · 6 comments

Comments

@Motherboard
Copy link

As I understand it, the difference between Google's model and the pretrained available here is the quality and size of the training set.

Would it be possible to take a high quality long librivox recording and use mozilla's STT model to pin point the timing of each spoken word (we already have the ground truth text from librivox, so it's only a matter of timing it)?

We could get some tens of hours of single person recording this way.

Does it make sense? How easy is this to accomplish? I could have a go if it's not hard, haven't messed with deepspeech yet, and haven't looked at how the dataset is encoded yet, so I don't know how hard or important it is.

@cuuupid
Copy link
Contributor

cuuupid commented May 24, 2018

Experimenting with a similar idea--could also feed the raw text into another TTS model and generate tons of training data this way. This could boost the model accuracy and coherence, and then we can further condition using WaveNet and retrain on human data.

Google's dataset is apparently around 25 hours so we would need this amount of training data (roughly 4x the amount existing right now).

@erogol
Copy link
Contributor

erogol commented May 24, 2018

@Motherboard real difference b/w Google's model and TTS is WaveNet. It gives a huge boost at fidelity. It is kind of a holly grail of TTS systems right now, nobody except Google makes it work for real-time systems. And I believe they use > 25 h data for their deployed system contrary to what they suggest in the paper. Trained TTS models are really weak to generalize the unseen words otherwise, especially if they are not trained with phonemes but characters.

I think it is quite a smart way to segment the data, if it works as you described. If you try this pls let me know the result. However, by the first sight, it is viable way to curate a dataset.
tv

@pshah123 Using another TTS system is a sassy way to augment data and it might lead some license issues if you issue it professionally.

@erogol
Copy link
Contributor

erogol commented May 24, 2018

@Motherboard regarding the Mozilla Common Voice data, I need some more work here to make TTS more stable before dwelling into data curation. However, it is definitely in the queue.

@Motherboard
Copy link
Author

Motherboard commented May 24, 2018

Thanks for the input.

TTS is based on tacotron, right? Google's tacotron model (which, to the best of my knowledge uses Griffin-Lim vocoder, and not wavenet) sounds far superior to any public model I've heard so far (also superior to tacotron 2 public models which utilize wavenet, and r9y9's wavenet model sounds quite good - so I'm not sure it's the thing holding the system back)

By the way, what are the downsides of using tacotron 2 over tacotron? There's a BSD licensed implementation in pytorch of tacotron 2 in NVIDIA's git repository, and there's also r9y9's implementation, why not move to one of these instead of stabilizing a tacotron based application?

Also, Facebook had a paper out showing VoiceLoop (and even char2wav) gets better MOS than tacotron on the publicly available datasets...

By the way, I really like your blog :)

@erogol
Copy link
Contributor

erogol commented May 24, 2018

@Motherboard i don't remember the paper exactly but they might be using phonemes for english instead of characters. That might lead the difference. Othwerwise, I am not quite sure what more is. Maybe hyper-parameters, slight engineering tweaks, better data or small bug with our model :)

I did not try Tacotron2. However as I see from the other papers mostly about speaker embedding, I see they use Tacotron over Tacotron2 for some reason.

I started to add changes towards Tacotron2 (#26) however it is a slow procession since I like to see the effect of each changes over the results. So far nothing has a promising improvement. My feeling is that Google use Tacotron2 since WaveNet is also able to recover the sacrifice of the architectural change so I am suspicious if Tacotron2 is better with any other vocoder.

I have VoiceLoop implementation as well but they also use phonemes and I cannot make it learn with raw characters. Since use of phonemes is a limiting factor for language transition, I'd prefer to go with Tacotron.

THX :)

@erogol
Copy link
Contributor

erogol commented Jul 4, 2018

No activity here, feel free to reopen

@erogol erogol closed this as completed Jul 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants