Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocoder Training #40

Closed
G-Wang opened this issue Jul 8, 2019 · 6 comments
Closed

Vocoder Training #40

G-Wang opened this issue Jul 8, 2019 · 6 comments

Comments

@G-Wang
Copy link

G-Wang commented Jul 8, 2019

Hello, very nice repo, especially implementation of D-vector speaker verification architecture.

Quick question about the vocoder model training. Does the WaveRNN vocoder take d-vector speaker embedding, or is it just trained on all speakers in the training dataset without explicit conditioning? Curious about any vocoder experiments you've run in this regard.

@CorentinJ
Copy link
Owner

Thanks.

The vocoder is not conditioned on anything other than the mel spectrograms, as also stated in the SV2TTS paper, section 2.3:

The network is not directly conditioned on the output of the speaker encoder. The mel spectrogram predicted by the synthesizer network captures all of the relevant detail needed for high quality synthesis of a variety of voices, allowing a multispeaker vocoder to be constructed by simply training on data from many speakers.

You can also see it on this diagram:
image

I don't know about an explicit conditioning on the vocoder, I think that all the necessary features are generated from the synthesizer alone. I did however think about indirect conditioning by finetuning all three models at once in a complete end-to-end training loop. Then I remembered that my implementation is Frankenstein-worthy with its mix of pytorch and tensorflow, and I quickly gave up on that idea.

@G-Wang
Copy link
Author

G-Wang commented Jul 8, 2019

Thanks for the clarification. I agree the mel-spectrogram should contain sufficient information without explicit speaker conditioning for vocoder.

Just found that a universal wavernn vocoder has been implemented here, trained on LibriTTS: mozilla/TTS#221. With the same Fatcord WaveRNN model you used.

The out-of-sample audio samples sounds decently well: https://soundcloud.com/user-565970875/sets/universal-vocoder-with-wavernn However it still exhibit some small artifacts/noises, which hopefully with small samples fine-tune will fix.

@G-Wang G-Wang closed this as completed Jul 8, 2019
@CorentinJ
Copy link
Owner

I believe the part of my implementation that would need work is the synthesizer. Unfortunately I haven't much time nor GPU power to work on a better implementation. I had planned to look at that mozilla implementation, maybe I'll find some interesting things there.

@MorganCZY
Copy link

@CorentinJ you mentioned in your thesis that you tried to train the WaveRNN with pruning algorithm when doing this thesis. But you didn't get a complete and fine-trained model due to time limitation. I wonder if you are still doing this part? If not, have you ever thought about releasing related scripts?

@CorentinJ
Copy link
Owner

CorentinJ commented Jul 13, 2019

Oh yeah it's a bit unclear there. It says that I experimented with them and that I report my results in section 3.5.2, but the experimenting part was more about seeing if I could get it to run faster by pruning, not improving the quality. In Kalchbrenner et al. they mention both improvements to speed and quality, although quality seems to be their main concern. My conclusion is this paragraph:

Sparse tensors are, at the time of writing, yet an experimental feature in PyTorch. Their implementation might not be as efficient as the one the authors used. Through experiments, we find that the matrix multiply operation addmm for a sparse matrix and a dense vector only breaks even time-wise with the dense-only addmm for levels of sparsity above 91%. Below this value, using sparse tensors will actually slow down the forward pass speed. The authors report sparsity levels of 96.4% and 97.8% (Kalchbrenner et al., 2018, Table 5) while maintaining decent performances. Our tests indicate that, at best, a sparsity level of 96.4% would lower the real-time threshold to 7.86 seconds, and a level of 97.8% to 4.44 seconds. These are optimistic lower bounds on the actual threshold due to our assumption of constant time inference, and also because some layers in the model cannot be sparsified. This preliminary analysis indicates that pruning the vocoder would be beneficial to inference speed.

tl;dr: pruning might bring a small improvement.

So I did not actually train a full model with pruning, what I did was replacing the layers in the alternative WaveRNN with sparse layers and I looked at the inference speed for various levels of sparsity. If you want a training script with pruning, I'd recommend looking at fatchord's repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants