Use the melspectrogram as the input of WaveNet Vocoder seems failed #128

mazzzystar · 2019-03-12T10:51:06Z

Thanks for your work !
I use tts melspectrogram output directly as the input of r9y9's wavenet_vocoder pretrained model in order to get better quality, but it shows up that WaveNet Vocoder works fine on ground-truth melspectorgram, but performs badly on tts melspectrogram output.

I noticed you've also tried and meet similar situation, and you think it's beacause the quality of tts melspectrogram is not as high as required, could you please explain more detail about that ?

So as a conclusion, currently the most possible vocoder to combining with tts is:

WaveRNN
LPCNet

Is that right ? I'm still confused why I could get a relatively good synthesized audio from tts mel/linear spectrogram with GL, but use the melspectrogram as the input of WaveNet Vocoder get worse results. Below are the comparison samples.
gl_wv.zip

The text was updated successfully, but these errors were encountered:

erogol · 2019-03-12T13:22:16Z

@mazzzystar I've never tried WaveNet. My work has been around WaveRNN mostly.

You are right that spectrograms created by the current TTS master might not meet the needs of neural vocoders. Maybe the last shared model might have a chance.

If you wait a bit more, I plan to share the model described here soon #26 . It works at least with WaveRNN. That's I can fairly say.

OswaldoBornemann · 2019-03-12T13:46:36Z

@erogol can't wait to see your achievement.

mazzzystar · 2019-03-13T07:45:59Z

@erogol
I review the code difference betweenTTS: master and TTS:dev-taco2 branch, and wonder whether the reason for TTS melspectrogram failure on Vocoder is because we optimize both the liner_loss and mel_loss as code below ?

TTS/train.py

Lines 136 to 141 in 5acc9db

    
           mel_loss = criterion(mel_output, mel_input, mel_lengths) 
        
           linear_loss = (1 - c.loss_weight) * criterion(linear_output, linear_input, mel_lengths)\ 
        
               + c.loss_weight * criterion(linear_output[:, :, :n_priority_freq], 
        
                                 linear_input[:, :, :n_priority_freq], 
        
                                 mel_lengths) 
        
           loss = mel_loss + linear_loss

If we only optimize the mel_loss, could it possible for TTS to generate high quality melspectrogram ?

mazzzystar · 2019-03-13T08:14:22Z

And yet there is another issue at the dev-taco2 branch.

TTS/train.py

Lines 126 to 132 in cf11b6c

    
           stop_loss = criterion_st(stop_tokens, stop_targets) 
        
           decoder_loss = criterion(decoder_output, mel_input, mel_lengths) 
        
           if c.model == "Tacotron": 
        
               postnet_loss = criterion(postnet_output, linear_input, mel_lengths) 
        
           else: 
        
               postnet_loss = criterion(postnet_output, mel_input, mel_lengths) 
        
           loss = decoder_loss + postnet_loss

Here you compute the the L1loss for decoder_output and postnet_output with mel_input at the same time. Can you explain about why ? As in my cognition：

(1) decoder_output, stop_tokens, alignments = self.decoder(encoder_outputs, mel_specs, mask)
(2) postnet_output = self.postnet(decoder_output)
(3) postnet_output = decoder_output + postnet_output

So if we try to minimize the L1loss(decoder_output, mel_input) and L1loss(postnet_out, mel_input) at the same time, then code in line(2) should output as close as posible to 0.

erogol · 2019-03-13T10:49:10Z

@mazzzystar I've tried once to optimize Tacotron for only mel-spectrogram and couldn't get good results. But maybe there is a space to investigate more.

@mazzzystar Why do you think line 2 should output 0? My feeling is that postnet tries to learn fine-grain information that is missing right after the decoder. And if you compare these two outputs, you also see that there is an important loss difference between postnet_output and decoder_output in the inference time.

mazzzystar · 2019-03-13T11:47:57Z

The reason I think that is, you calculate the loss on 2 different output with the same ground-truth. It's reasonable for me to compare only the postnet output(code line(3)) with the true melspectrogram, and backward the loss to update the whole encoder, decoder and postnet.

That is to say, if you already know that the output from decoder is imperfect, and it will optimized it in postnet, then why you want L1loss(mel_input, decoder_output) to be as near as 0 ?

erogol · 2019-03-13T13:22:22Z

@mazzzystar for Tacotron the reason was that, linear output has too much redundancy and so it prevents decoder to learn reduments. Therefore it uses mel-spectrograms as decoder output. Then the postnet only needs to learn projecting mel-spec to linear which is possibly an easier task to learn. Also you can change the postnet as some point with a better alternative as you keep the rest the same.

For Tacotron2 the ideas is similar but not the same. So with the decoder we learn a rough spectrogram representation that enables decoder to also learn the alignment. Then we train the postnet to only learn the fine details. If we'd use a sinlge loss function with the ultimate network output we couldn't force the network to have this kind of modularization. To be more concrete, here are the tacotron2 outputs for the decoder and the postnet. It is visually clear what I mean.

Final output

Postnet output

mazzzystar · 2019-03-14T03:24:10Z

Thanks for the clarification, I'm now get some points of your idea.
So you mean in Tacotron1, the Decoder tries to output the mel-spectrogram, and the PostCBHG is only trying to project the mel-spectrogram to linear-spectrogram, so we need both make sure <mel_output, mel_input> and <linear_output, linear_output> be the same.

While in Tacotron2, the Decoder is already good for getting the main part of the mel-spectrogram(not the linear-spectrogram), so the Postnet here is only to add some "texture" for getting better mel-spectrogram result, right ?

If I want to feed the mel-output to the Vocoder(e.g, WaveNet), it's better to use the final output rather than the Decoder output, right ?

erogol · 2019-03-14T15:16:24Z

@mazzzystar yep all is true.

For tacotron2, yes you should use the final network output but I've not tried linear specs of tacotron1. So you might give a try.

mazzzystar · 2019-03-21T12:52:43Z

@erogol
Which kind of vocoder you used on the conclusion spectrograms created by the current TTS master might not meet the needs of neural vocoders? I recently tried TTS and Tacotron2 branch with r9y9's wavenet_vocoder, and suddenly realized that, I set different preprocessing parameters for TTS and wavenet_vocoder. Have you experimented on exactly the same parameters for two model ?

erogol · 2019-03-21T14:47:56Z

@mazzzystar Yes exactly the same parameter. My vocoder fork is here https://github.com/erogol/WaveRNN. You can pass the config used by TTS to WaveRNN and it works.

mazzzystar · 2019-03-21T16:00:52Z

I will try to fit the parameters of TTS/Tacotron2 with r9y9's wavenet_vocoder and tell you my results on WaveNet Vocoder.

OswaldoBornemann · 2019-03-21T17:15:03Z

@erogol Could WaveRNN do a great job on 48000kHz wav ?

erogol · 2019-03-22T11:09:03Z

@tsungruihon 16-22 kHz would be enough. Never tried 48K

m-hamza-mughal · 2020-11-20T15:01:00Z

@mazzzystar can you share your findings with TTS/Tacotron2 and wavenet vocoder

erogol closed this as completed Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the melspectrogram as the input of WaveNet Vocoder seems failed #128

Use the melspectrogram as the input of WaveNet Vocoder seems failed #128

mazzzystar commented Mar 12, 2019 •

edited

Loading

erogol commented Mar 12, 2019

OswaldoBornemann commented Mar 12, 2019

mazzzystar commented Mar 13, 2019

mazzzystar commented Mar 13, 2019 •

edited

Loading

erogol commented Mar 13, 2019

mazzzystar commented Mar 13, 2019 •

edited

Loading

erogol commented Mar 13, 2019 •

edited

Loading

mazzzystar commented Mar 14, 2019

erogol commented Mar 14, 2019 •

edited

Loading

mazzzystar commented Mar 21, 2019 •

edited

Loading

erogol commented Mar 21, 2019

mazzzystar commented Mar 21, 2019

OswaldoBornemann commented Mar 21, 2019

erogol commented Mar 22, 2019

m-hamza-mughal commented Nov 20, 2020

Use the melspectrogram as the input of WaveNet Vocoder seems failed #128

Use the melspectrogram as the input of WaveNet Vocoder seems failed #128

Comments

mazzzystar commented Mar 12, 2019 • edited Loading

erogol commented Mar 12, 2019

OswaldoBornemann commented Mar 12, 2019

mazzzystar commented Mar 13, 2019

mazzzystar commented Mar 13, 2019 • edited Loading

erogol commented Mar 13, 2019

mazzzystar commented Mar 13, 2019 • edited Loading

erogol commented Mar 13, 2019 • edited Loading

mazzzystar commented Mar 14, 2019

erogol commented Mar 14, 2019 • edited Loading

mazzzystar commented Mar 21, 2019 • edited Loading

erogol commented Mar 21, 2019

mazzzystar commented Mar 21, 2019

OswaldoBornemann commented Mar 21, 2019

erogol commented Mar 22, 2019

m-hamza-mughal commented Nov 20, 2020

mazzzystar commented Mar 12, 2019 •

edited

Loading

mazzzystar commented Mar 13, 2019 •

edited

Loading

mazzzystar commented Mar 13, 2019 •

edited

Loading

erogol commented Mar 13, 2019 •

edited

Loading

erogol commented Mar 14, 2019 •

edited

Loading

mazzzystar commented Mar 21, 2019 •

edited

Loading