-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strange noises in your samples && error when running inference.py #30
Comments
|
Spectrogram of zero-filled audio reconstruction looks like this: the line noise appears every 4 frequency bins. EDIT: total frequency bins are 512. So the pattern appears for every 4 bins, not 8. The y-axis of the figure below is wrong. |
I hope to fix it by matching the implementation details with official implementation. See #17. |
I have trained and tested the official MelGAN repo. Synthesized samples are with audible noises. The overall effect is far less than the official pretrained model. |
Oh, does that mean we need to use some tricks (that aren’t shown in paper) to properly train the model? |
I highly doubt there are some training tricks that are not shown in the official repo codes. I left an issue at their repo, but haven't received a reply till now. |
Checkboard artifacts have been an issue with image GANs before, see this article: https://distill.pub/2016/deconv-checkerboard/ I think some of these audio artifacts may be related. The main way to get rid of them was to replace strided conv layers with bilinear upsample/downsample + conv layers, or to ensure that kernels were exact multiples of their strides. The discriminator here appears to have kernels of 41 with strides of 4, I wonder what would happen if we stuck a bilinear downsample 4x before those convs, and set the stride to 1. I'm going to try this out myself, but first I'm waiting for a model I'm training on part of VoxCeleb2 (the full dataset doesn't fit in my ssd) to hit 1M training steps before I try this to see if there's any improvement. |
Nice point, but isn't it a problem of a generator? The generator architecture doesn't seem to have that kind of problem, but only discriminator does. |
At the end of that article, just before the conclusion, they found that discriminators with stride=2 in the first layer could also cause the generator to create the checkboard artifacts. The explanation was that some of the neurons in the generator will get many times the gradient due to the striding in the discriminator, and that helps create the artifacts. I don't know if that would apply to this audio GAN, but it seems like a fairly simple thing to check. I have modified the discriminator in my fork, and I will start a training run tonight to see if it helps. |
I feel sorry to hear that. Is using nn.Upsample for downsampling is okay? The documentation says
Thanks for sharing your results, by the way. |
@bob80333 Hi, I try to train the melgan on csmsc dataset, which is a single speaker dataset about 20 hours. My understand is the discriminator should converge pretty fast because at the very beginning the generator's result is indeed very easy to discern, since the result is very bad. And if you run for more epoches, you may find that the generator's result is improved at some time. Here is my tensorboard log. |
Hey, thanks for the information! I have trained on my dataset (part of VoxCeleb2) with the current master branch for 1M steps and got this training curve: The results were understandable, but the voices themselves had artifacts while speaking, which is why I commented in this issue with ideas to fix it. The first modification I tried, I waited 80k steps, at which point the discriminator had gotten to 3.3e-5 loss and the generator was generating loud high pitched noises. I tried other approaches but the discriminator converged really quickly again, and I didn't want to wait to see if it failed, especially since my original training curve was very different from that. |
I've trained with |
Issues that were initially discussed here are now resolved, but I loved the idea and countless trials of @bob80333 to improve the quality. |
Your samples at epoch 3200 have strange noises at unvoiced segments, while there is no such phenomenon in samples at epoch 1600.
Besides, when running inference.py, an error occurs, pointing to
melgan/model/generator.py
Line 68 in 8af1e9c
The text was updated successfully, but these errors were encountered: