-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the melspectrogram as the input of WaveNet Vocoder seems failed #128
Comments
@mazzzystar I've never tried WaveNet. My work has been around WaveRNN mostly. You are right that spectrograms created by the current TTS master might not meet the needs of neural vocoders. Maybe the last shared model might have a chance. If you wait a bit more, I plan to share the model described here soon #26 . It works at least with WaveRNN. That's I can fairly say. |
@erogol can't wait to see your achievement. |
@erogol Lines 136 to 141 in 5acc9db
If we only optimize the |
And yet there is another issue at the Lines 126 to 132 in cf11b6c
Here you compute the the L1loss for decoder_output and postnet_output with mel_input at the same time. Can you explain about why ? As in my cognition:
So if we try to minimize the |
@mazzzystar I've tried once to optimize Tacotron for only mel-spectrogram and couldn't get good results. But maybe there is a space to investigate more. @mazzzystar Why do you think line 2 should output 0? My feeling is that postnet tries to learn fine-grain information that is missing right after the decoder. And if you compare these two outputs, you also see that there is an important loss difference between postnet_output and decoder_output in the inference time. |
The reason I think that is, you calculate the loss on 2 different output with the same ground-truth. It's reasonable for me to compare only the That is to say, if you already know that the output from |
@mazzzystar for Tacotron the reason was that, linear output has too much redundancy and so it prevents decoder to learn reduments. Therefore it uses mel-spectrograms as decoder output. Then the postnet only needs to learn projecting mel-spec to linear which is possibly an easier task to learn. Also you can change the postnet as some point with a better alternative as you keep the rest the same. For Tacotron2 the ideas is similar but not the same. So with the decoder we learn a rough spectrogram representation that enables decoder to also learn the alignment. Then we train the postnet to only learn the fine details. If we'd use a sinlge loss function with the ultimate network output we couldn't force the network to have this kind of modularization. To be more concrete, here are the tacotron2 outputs for the decoder and the postnet. It is visually clear what I mean. |
Thanks for the clarification, I'm now get some points of your idea. While in If I want to feed the mel-output to the |
@mazzzystar yep all is true. For tacotron2, yes you should use the final network output but I've not tried linear specs of tacotron1. So you might give a try. |
@erogol |
@mazzzystar Yes exactly the same parameter. My vocoder fork is here https://github.com/erogol/WaveRNN. You can pass the config used by TTS to WaveRNN and it works. |
I will try to fit the parameters of TTS/Tacotron2 with r9y9's wavenet_vocoder and tell you my results on WaveNet Vocoder. |
@erogol Could WaveRNN do a great job on 48000kHz wav ? |
@tsungruihon 16-22 kHz would be enough. Never tried 48K |
@mazzzystar can you share your findings with TTS/Tacotron2 and wavenet vocoder |
Thanks for your work !
I use
tts
melspectrogram output directly as the input of r9y9's wavenet_vocoder pretrained model in order to get better quality, but it shows up that WaveNet Vocoder works fine on ground-truth melspectorgram, but performs badly ontts
melspectrogram output.I noticed you've also tried and meet similar situation, and you think it's beacause the quality of
tts
melspectrogram is not as high as required, could you please explain more detail about that ?So as a conclusion, currently the most possible vocoder to combining with
tts
is:Is that right ? I'm still confused why I could get a relatively good synthesized audio from
tts
mel/linear spectrogram with GL, but use the melspectrogram as the input of WaveNet Vocoder get worse results. Below are the comparison samples.gl_wv.zip
The text was updated successfully, but these errors were encountered: