strange noises in your samples && error when running inference.py #30

MorganCZY · 2019-11-11T08:20:03Z

Your samples at epoch 3200 have strange noises at unvoiced segments, while there is no such phenomenon in samples at epoch 1600.

Besides, when running inference.py, an error occurs, pointing to

melgan/model/generator.py

Line 68 in 8af1e9c

mel = torch.cat((mel, zero), axis=2)

torch.cat() has a parameter "dim" rather than "axis"

seungwonpark · 2019-11-11T08:40:57Z

Fixed the latter issue, thank you!
Yes, I was aware of that issue. I also found out that results from mel-spectrogram generated from 0-filled audio (which is -11.5129-filled mel) are very noisy. I really need to solve this issue.

seungwonpark · 2019-11-11T09:15:15Z

zeromel.zip

Spectrogram of zero-filled audio reconstruction looks like this: the line noise appears every 4 frequency bins.

EDIT: total frequency bins are 512. So the pattern appears for every 4 bins, not 8. The y-axis of the figure below is wrong.

seungwonpark · 2019-11-11T09:38:00Z

I hope to fix it by matching the implementation details with official implementation. See #17.

MorganCZY · 2019-11-11T10:01:22Z

I have trained and tested the official MelGAN repo. Synthesized samples are with audible noises. The overall effect is far less than the official pretrained model.

seungwonpark · 2019-11-11T10:04:46Z

Oh, does that mean we need to use some tricks (that aren’t shown in paper) to properly train the model?

MorganCZY · 2019-11-11T10:08:01Z

I highly doubt there are some training tricks that are not shown in the official repo codes. I left an issue at their repo, but haven't received a reply till now.

bob80333 · 2019-11-13T22:35:55Z

Checkboard artifacts have been an issue with image GANs before, see this article: https://distill.pub/2016/deconv-checkerboard/

I think some of these audio artifacts may be related. The main way to get rid of them was to replace strided conv layers with bilinear upsample/downsample + conv layers, or to ensure that kernels were exact multiples of their strides. The discriminator here appears to have kernels of 41 with strides of 4, I wonder what would happen if we stuck a bilinear downsample 4x before those convs, and set the stride to 1.

I'm going to try this out myself, but first I'm waiting for a model I'm training on part of VoxCeleb2 (the full dataset doesn't fit in my ssd) to hit 1M training steps before I try this to see if there's any improvement.

seungwonpark · 2019-11-14T04:19:49Z

Nice point, but isn't it a problem of a generator? The generator architecture doesn't seem to have that kind of problem, but only discriminator does.

bob80333 · 2019-11-14T18:46:16Z

At the end of that article, just before the conclusion, they found that discriminators with stride=2 in the first layer could also cause the generator to create the checkboard artifacts. The explanation was that some of the neurons in the generator will get many times the gradient due to the striding in the discriminator, and that helps create the artifacts.

I don't know if that would apply to this audio GAN, but it seems like a fairly simple thing to check. I have modified the discriminator in my fork, and I will start a training run tonight to see if it helps.

bob80333 · 2019-11-15T15:01:25Z

Tested my fork out, the discriminator converges really fast, and the generator learns nothing.

Note the scales here

What the generator's output looks like:

Swapping from strided convolutions to downsampling appears to have made the discriminator much stronger, not sure how to fix that...

seungwonpark · 2019-11-15T16:08:12Z

I feel sorry to hear that.

Is using nn.Upsample for downsampling is okay? The documentation says

If you want downsampling/general resizing, you should use interpolate

Thanks for sharing your results, by the way.

bob80333 · 2019-11-15T21:05:35Z

Oh! Nice catch, I missed that in the docs. I just fixed it in my fork, training is slightly better with this, but the discriminator still overpowers the generator quickly.

Discriminator converged in 2k steps rather than <500 steps.

geekboood · 2019-11-17T02:09:36Z

@bob80333 Hi, I try to train the melgan on csmsc dataset, which is a single speaker dataset about 20 hours. My understand is the discriminator should converge pretty fast because at the very beginning the generator's result is indeed very easy to discern, since the result is very bad. And if you run for more epoches, you may find that the generator's result is improved at some time. Here is my tensorboard log.

As you can see in the figure, the generator's loss stuck at around 120 before 300k iteration and after that, the loss is getting good. At the same time, the loss of the discriminator fluctuates a lot. I can hear something after 1.1M step, but it still with some artifacts. Maybe I should wait for 2M iteration.
Also I found that at the end of each audio, there exist a peak that generates the noise.

bob80333 · 2019-11-17T02:33:01Z

Hey, thanks for the information! I have trained on my dataset (part of VoxCeleb2) with the current master branch for 1M steps and got this training curve:

The results were understandable, but the voices themselves had artifacts while speaking, which is why I commented in this issue with ideas to fix it. The first modification I tried, I waited 80k steps, at which point the discriminator had gotten to 3.3e-5 loss and the generator was generating loud high pitched noises. I tried other approaches but the discriminator converged really quickly again, and I didn't want to wait to see if it failed, especially since my original training curve was very different from that.

seungwonpark · 2019-12-02T05:21:42Z

I've trained with fix/17 branch for 14 days (more than 6400 epochs) with LJSpeech-1.1 dataset, and the results don't have strange noise at unvoiced segment! I'll soon upload new audio samples(with pre-trained model, if possible), and merge fix/17 branch to master.

seungwonpark · 2019-12-02T05:48:37Z

Issues that were initially discussed here are now resolved, but I loved the idea and countless trials of @bob80333 to improve the quality.
Feel free to have more discussion here, or you may want to open a new issue.

seungwonpark added a commit that referenced this issue Dec 2, 2019

fix #30, deploy changes to pytorch hub

7b747d1

seungwonpark closed this as completed in b6db549 Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange noises in your samples && error when running inference.py #30

strange noises in your samples && error when running inference.py #30

MorganCZY commented Nov 11, 2019

seungwonpark commented Nov 11, 2019

seungwonpark commented Nov 11, 2019 •

edited

Loading

seungwonpark commented Nov 11, 2019

MorganCZY commented Nov 11, 2019

seungwonpark commented Nov 11, 2019

MorganCZY commented Nov 11, 2019

bob80333 commented Nov 13, 2019

seungwonpark commented Nov 14, 2019

bob80333 commented Nov 14, 2019

bob80333 commented Nov 15, 2019

seungwonpark commented Nov 15, 2019

bob80333 commented Nov 15, 2019

geekboood commented Nov 17, 2019 •

edited

Loading

bob80333 commented Nov 17, 2019

seungwonpark commented Dec 2, 2019

seungwonpark commented Dec 2, 2019

strange noises in your samples && error when running inference.py #30

strange noises in your samples && error when running inference.py #30

Comments

MorganCZY commented Nov 11, 2019

seungwonpark commented Nov 11, 2019

seungwonpark commented Nov 11, 2019 • edited Loading

seungwonpark commented Nov 11, 2019

MorganCZY commented Nov 11, 2019

seungwonpark commented Nov 11, 2019

MorganCZY commented Nov 11, 2019

bob80333 commented Nov 13, 2019

seungwonpark commented Nov 14, 2019

bob80333 commented Nov 14, 2019

bob80333 commented Nov 15, 2019

seungwonpark commented Nov 15, 2019

bob80333 commented Nov 15, 2019

geekboood commented Nov 17, 2019 • edited Loading

bob80333 commented Nov 17, 2019

seungwonpark commented Dec 2, 2019

seungwonpark commented Dec 2, 2019

seungwonpark commented Nov 11, 2019 •

edited

Loading

geekboood commented Nov 17, 2019 •

edited

Loading