Vocoder Training #40

G-Wang · 2019-07-08T16:12:00Z

Hello, very nice repo, especially implementation of D-vector speaker verification architecture.

Quick question about the vocoder model training. Does the WaveRNN vocoder take d-vector speaker embedding, or is it just trained on all speakers in the training dataset without explicit conditioning? Curious about any vocoder experiments you've run in this regard.

CorentinJ · 2019-07-08T16:18:18Z

Thanks.

The vocoder is not conditioned on anything other than the mel spectrograms, as also stated in the SV2TTS paper, section 2.3:

The network is not directly conditioned on the output of the speaker encoder. The mel spectrogram predicted by the synthesizer network captures all of the relevant detail needed for high quality synthesis of a variety of voices, allowing a multispeaker vocoder to be constructed by simply training on data from many speakers.

You can also see it on this diagram:

I don't know about an explicit conditioning on the vocoder, I think that all the necessary features are generated from the synthesizer alone. I did however think about indirect conditioning by finetuning all three models at once in a complete end-to-end training loop. Then I remembered that my implementation is Frankenstein-worthy with its mix of pytorch and tensorflow, and I quickly gave up on that idea.

G-Wang · 2019-07-08T17:11:45Z

Thanks for the clarification. I agree the mel-spectrogram should contain sufficient information without explicit speaker conditioning for vocoder.

Just found that a universal wavernn vocoder has been implemented here, trained on LibriTTS: mozilla/TTS#221. With the same Fatcord WaveRNN model you used.

The out-of-sample audio samples sounds decently well: https://soundcloud.com/user-565970875/sets/universal-vocoder-with-wavernn However it still exhibit some small artifacts/noises, which hopefully with small samples fine-tune will fix.

CorentinJ · 2019-07-08T17:25:00Z

I believe the part of my implementation that would need work is the synthesizer. Unfortunately I haven't much time nor GPU power to work on a better implementation. I had planned to look at that mozilla implementation, maybe I'll find some interesting things there.

MorganCZY · 2019-07-13T07:40:32Z

@CorentinJ you mentioned in your thesis that you tried to train the WaveRNN with pruning algorithm when doing this thesis. But you didn't get a complete and fine-trained model due to time limitation. I wonder if you are still doing this part? If not, have you ever thought about releasing related scripts?

CorentinJ · 2019-07-13T09:04:29Z

Oh yeah it's a bit unclear there. It says that I experimented with them and that I report my results in section 3.5.2, but the experimenting part was more about seeing if I could get it to run faster by pruning, not improving the quality. In Kalchbrenner et al. they mention both improvements to speed and quality, although quality seems to be their main concern. My conclusion is this paragraph:

Sparse tensors are, at the time of writing, yet an experimental feature in PyTorch. Their implementation might not be as efficient as the one the authors used. Through experiments, we find that the matrix multiply operation addmm for a sparse matrix and a dense vector only breaks even time-wise with the dense-only addmm for levels of sparsity above 91%. Below this value, using sparse tensors will actually slow down the forward pass speed. The authors report sparsity levels of 96.4% and 97.8% (Kalchbrenner et al., 2018, Table 5) while maintaining decent performances. Our tests indicate that, at best, a sparsity level of 96.4% would lower the real-time threshold to 7.86 seconds, and a level of 97.8% to 4.44 seconds. These are optimistic lower bounds on the actual threshold due to our assumption of constant time inference, and also because some layers in the model cannot be sparsified. This preliminary analysis indicates that pruning the vocoder would be beneficial to inference speed.

tl;dr: pruning might bring a small improvement.

So I did not actually train a full model with pruning, what I did was replacing the layers in the alternative WaveRNN with sparse layers and I looked at the inference speed for various levels of sparsity. If you want a training script with pruning, I'd recommend looking at fatchord's repo.

mrgloom · 2019-07-25T09:10:56Z

Regarding wavernn inference speed:

Maybe it can help replacing GRU with SRU layers:

https://github.com/taolei87/sru
https://github.com/taolei87/sru/issues/96

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/vocoder/models/fatchord_version.py#L109

G-Wang closed this as completed Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocoder Training #40

Vocoder Training #40

G-Wang commented Jul 8, 2019

CorentinJ commented Jul 8, 2019

G-Wang commented Jul 8, 2019

CorentinJ commented Jul 8, 2019

MorganCZY commented Jul 13, 2019

CorentinJ commented Jul 13, 2019 •

edited

Loading

mrgloom commented Jul 25, 2019

Vocoder Training #40

Vocoder Training #40

Comments

G-Wang commented Jul 8, 2019

CorentinJ commented Jul 8, 2019

G-Wang commented Jul 8, 2019

CorentinJ commented Jul 8, 2019

MorganCZY commented Jul 13, 2019

CorentinJ commented Jul 13, 2019 • edited Loading

mrgloom commented Jul 25, 2019

CorentinJ commented Jul 13, 2019 •

edited

Loading