-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocoder Training #40
Comments
Thanks. The vocoder is not conditioned on anything other than the mel spectrograms, as also stated in the SV2TTS paper, section 2.3:
You can also see it on this diagram: I don't know about an explicit conditioning on the vocoder, I think that all the necessary features are generated from the synthesizer alone. I did however think about indirect conditioning by finetuning all three models at once in a complete end-to-end training loop. Then I remembered that my implementation is Frankenstein-worthy with its mix of pytorch and tensorflow, and I quickly gave up on that idea. |
Thanks for the clarification. I agree the mel-spectrogram should contain sufficient information without explicit speaker conditioning for vocoder. Just found that a universal wavernn vocoder has been implemented here, trained on LibriTTS: mozilla/TTS#221. With the same Fatcord WaveRNN model you used. The out-of-sample audio samples sounds decently well: https://soundcloud.com/user-565970875/sets/universal-vocoder-with-wavernn However it still exhibit some small artifacts/noises, which hopefully with small samples fine-tune will fix. |
I believe the part of my implementation that would need work is the synthesizer. Unfortunately I haven't much time nor GPU power to work on a better implementation. I had planned to look at that mozilla implementation, maybe I'll find some interesting things there. |
@CorentinJ you mentioned in your thesis that you tried to train the WaveRNN with pruning algorithm when doing this thesis. But you didn't get a complete and fine-trained model due to time limitation. I wonder if you are still doing this part? If not, have you ever thought about releasing related scripts? |
Oh yeah it's a bit unclear there. It says that I experimented with them and that I report my results in section 3.5.2, but the experimenting part was more about seeing if I could get it to run faster by pruning, not improving the quality. In Kalchbrenner et al. they mention both improvements to speed and quality, although quality seems to be their main concern. My conclusion is this paragraph:
tl;dr: pruning might bring a small improvement. So I did not actually train a full model with pruning, what I did was replacing the layers in the alternative WaveRNN with sparse layers and I looked at the inference speed for various levels of sparsity. If you want a training script with pruning, I'd recommend looking at fatchord's repo. |
Regarding wavernn inference speed: Maybe it can help replacing GRU with SRU layers: https://github.com/taolei87/sru |
Hello, very nice repo, especially implementation of D-vector speaker verification architecture.
Quick question about the vocoder model training. Does the WaveRNN vocoder take d-vector speaker embedding, or is it just trained on all speakers in the training dataset without explicit conditioning? Curious about any vocoder experiments you've run in this regard.
The text was updated successfully, but these errors were encountered: