Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the melspectrogram as the input of WaveNet Vocoder seems failed #128

Closed
mazzzystar opened this issue Mar 12, 2019 · 15 comments
Closed

Comments

@mazzzystar
Copy link

mazzzystar commented Mar 12, 2019

Thanks for your work !
I use tts melspectrogram output directly as the input of r9y9's wavenet_vocoder pretrained model in order to get better quality, but it shows up that WaveNet Vocoder works fine on ground-truth melspectorgram, but performs badly on tts melspectrogram output.

I noticed you've also tried and meet similar situation, and you think it's beacause the quality of tts melspectrogram is not as high as required, could you please explain more detail about that ?

So as a conclusion, currently the most possible vocoder to combining with tts is:

Is that right ? I'm still confused why I could get a relatively good synthesized audio from tts mel/linear spectrogram with GL, but use the melspectrogram as the input of WaveNet Vocoder get worse results. Below are the comparison samples.
gl_wv.zip

@erogol
Copy link
Contributor

erogol commented Mar 12, 2019

@mazzzystar I've never tried WaveNet. My work has been around WaveRNN mostly.

You are right that spectrograms created by the current TTS master might not meet the needs of neural vocoders. Maybe the last shared model might have a chance.

If you wait a bit more, I plan to share the model described here soon #26 . It works at least with WaveRNN. That's I can fairly say.

@erogol erogol closed this as completed Mar 12, 2019
@OswaldoBornemann
Copy link

@erogol can't wait to see your achievement.

@mazzzystar
Copy link
Author

@erogol
I review the code difference betweenTTS: master and TTS:dev-taco2 branch, and wonder whether the reason for TTS melspectrogram failure on Vocoder is because we optimize both the liner_loss and mel_loss as code below ?

TTS/train.py

Lines 136 to 141 in 5acc9db

mel_loss = criterion(mel_output, mel_input, mel_lengths)
linear_loss = (1 - c.loss_weight) * criterion(linear_output, linear_input, mel_lengths)\
+ c.loss_weight * criterion(linear_output[:, :, :n_priority_freq],
linear_input[:, :, :n_priority_freq],
mel_lengths)
loss = mel_loss + linear_loss

If we only optimize the mel_loss, could it possible for TTS to generate high quality melspectrogram ?

@mazzzystar
Copy link
Author

mazzzystar commented Mar 13, 2019

And yet there is another issue at the dev-taco2 branch.

TTS/train.py

Lines 126 to 132 in cf11b6c

stop_loss = criterion_st(stop_tokens, stop_targets)
decoder_loss = criterion(decoder_output, mel_input, mel_lengths)
if c.model == "Tacotron":
postnet_loss = criterion(postnet_output, linear_input, mel_lengths)
else:
postnet_loss = criterion(postnet_output, mel_input, mel_lengths)
loss = decoder_loss + postnet_loss

Here you compute the the L1loss for decoder_output and postnet_output with mel_input at the same time. Can you explain about why ? As in my cognition:

(1) decoder_output, stop_tokens, alignments = self.decoder(encoder_outputs, mel_specs, mask)
(2) postnet_output = self.postnet(decoder_output)
(3) postnet_output = decoder_output + postnet_output

So if we try to minimize the L1loss(decoder_output, mel_input) and L1loss(postnet_out, mel_input) at the same time, then code in line(2) should output as close as posible to 0.

@erogol
Copy link
Contributor

erogol commented Mar 13, 2019

@mazzzystar I've tried once to optimize Tacotron for only mel-spectrogram and couldn't get good results. But maybe there is a space to investigate more.

@mazzzystar Why do you think line 2 should output 0? My feeling is that postnet tries to learn fine-grain information that is missing right after the decoder. And if you compare these two outputs, you also see that there is an important loss difference between postnet_output and decoder_output in the inference time.

@mazzzystar
Copy link
Author

mazzzystar commented Mar 13, 2019

The reason I think that is, you calculate the loss on 2 different output with the same ground-truth. It's reasonable for me to compare only the postnet output(code line(3)) with the true melspectrogram, and backward the loss to update the whole encoder, decoder and postnet.

That is to say, if you already know that the output from decoder is imperfect, and it will optimized it in postnet, then why you want L1loss(mel_input, decoder_output) to be as near as 0 ?

@erogol
Copy link
Contributor

erogol commented Mar 13, 2019

@mazzzystar for Tacotron the reason was that, linear output has too much redundancy and so it prevents decoder to learn reduments. Therefore it uses mel-spectrograms as decoder output. Then the postnet only needs to learn projecting mel-spec to linear which is possibly an easier task to learn. Also you can change the postnet as some point with a better alternative as you keep the rest the same.

For Tacotron2 the ideas is similar but not the same. So with the decoder we learn a rough spectrogram representation that enables decoder to also learn the alignment. Then we train the postnet to only learn the fine details. If we'd use a sinlge loss function with the ultimate network output we couldn't force the network to have this kind of modularization. To be more concrete, here are the tacotron2 outputs for the decoder and the postnet. It is visually clear what I mean.

Final output
image

Postnet output
image

@mazzzystar
Copy link
Author

Thanks for the clarification, I'm now get some points of your idea.
So you mean in Tacotron1, the Decoder tries to output the mel-spectrogram, and the PostCBHG is only trying to project the mel-spectrogram to linear-spectrogram, so we need both make sure <mel_output, mel_input> and <linear_output, linear_output> be the same.

While in Tacotron2, the Decoder is already good for getting the main part of the mel-spectrogram(not the linear-spectrogram), so the Postnet here is only to add some "texture" for getting better mel-spectrogram result, right ?

If I want to feed the mel-output to the Vocoder(e.g, WaveNet), it's better to use the final output rather than the Decoder output, right ?

@erogol
Copy link
Contributor

erogol commented Mar 14, 2019

@mazzzystar yep all is true.

For tacotron2, yes you should use the final network output but I've not tried linear specs of tacotron1. So you might give a try.

@mazzzystar
Copy link
Author

mazzzystar commented Mar 21, 2019

@erogol
Which kind of vocoder you used on the conclusion spectrograms created by the current TTS master might not meet the needs of neural vocoders? I recently tried TTS and Tacotron2 branch with r9y9's wavenet_vocoder, and suddenly realized that, I set different preprocessing parameters for TTS and wavenet_vocoder. Have you experimented on exactly the same parameters for two model ?

@erogol
Copy link
Contributor

erogol commented Mar 21, 2019

@mazzzystar Yes exactly the same parameter. My vocoder fork is here https://github.com/erogol/WaveRNN. You can pass the config used by TTS to WaveRNN and it works.

@mazzzystar
Copy link
Author

I will try to fit the parameters of TTS/Tacotron2 with r9y9's wavenet_vocoder and tell you my results on WaveNet Vocoder.

@OswaldoBornemann
Copy link

@erogol Could WaveRNN do a great job on 48000kHz wav ?

@erogol
Copy link
Contributor

erogol commented Mar 22, 2019

@tsungruihon 16-22 kHz would be enough. Never tried 48K

@m-hamza-mughal
Copy link

@mazzzystar can you share your findings with TTS/Tacotron2 and wavenet vocoder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants