-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference/recipe not working properly. #640
Comments
You can check if it is about r. You can init the model with the default r value then change it to 2 and run the inference. |
Ok I have done that now, it appears that the r-value is not the issue as it sounds the same, is it normal for inferences to sound like this at these points in training? or do you think there is another issue possibly occurring? |
Try disabling Also, you can check the working released models and try copying their config for your run. Maybe I missed something as I was updating TTS with the new Trainer API. |
The recipe that was used already has mixed-precision set to false, I tried using the pre-trained model's config for this model instead of the recipe config, I attempted to inference in a variety of ways with this other config to make it work, but was only able to get this.
Update: |
you need to retrain with the new config, especially if there are different audio parameters. It is not enough to change it for only inference. |
Okay, but the pretrained config has double decoder consistency disabled for some reason... so do I enable that and keep everything else the same in the config? |
you can enable it and keep the rest the same. It is disabled since the 2nd decoder is removed to reduce the model size. |
Okay attempting that now, at what point would you recommend attempting inference and checking audio quality? total epochs is 1,000 epochs. |
In general, after 20k steps, it should start producing understandable speech. |
It's now at about 100K steps (Training with the pretrained model config, only change made is setting DDC to true.) output20k.mp4output60k.mp4output80k.mp4 |
Why there is no image on the Tensorboard? there should be alignment images. |
But looking at the audio samples, there looks to be something is broken in the model or the configs you use. Also pls share the alignment images too, to see if the bug is in the inference or the training code. |
Never knew the tensorboard was supposed to show alignment images, is it supposed to show spectrogram images as well? I'm not sure where I would find those images at all. This is the tensorboard command used, does it seem correct to you?: |
Since two weeks I have the same type of problems with Tacotron2-DDC Inference. My trained models with version 0.1.2 looks fine in Tensorboard and the audio in Tensorboard is intelligible, but the inference audio is broken. Until now I searched for errors in my settings, but the present issue description by Billy Bob makes me think that there is really a problem with inference. My understanding is that the models released in the past should work with the latest Coqui-TTS versions. Therefore I did some inference tests with the Tacotron2-DDC LJSpeech model, released in April 2021. I used the following script
and started with version 0.0.12 (git checkout a53958a). It works as expected. Here are the logs, the signal-figure and the sound:
ljspeech_v0.0.12.mp4Version 0.0.13 (git checkout f02f033) works also fine. In version 0.0.14 (git checkout 5482a0f) the following error is reported:
I was not able to debug this problem and could not check if the inference is working. Version 0.0.15 (git checkout b8b79a5) shows no errors in the logs, but the sound is bad. ljspeech_v0.0.15.mp4Same results for version 0.0.15.1 (git checkout d245b5d) Versions 0.1.0 (git checkout c25a218), 0.1.1 (git checkout 676d22f), 0.1.2 (git checkout 8fbadad) and main show a warning that the Here are the logs, the signal-figure and the sound for the latest version 0.1.2:
ljspeech_v0.1.2.mp4I hope my report helps to solve the problem. |
yes it should show all these. Why don't use just run tensorboard locally ? Maybe uploading breaks things |
@mbarnig very helpful !! Thx for going under the hood. So it looks like we have something wrong after 0.15 I'll check and try to find that little 🐛 |
To complete my report I did some inference tests in version 0.1.2. with the other released english models. Here are my findings: GlowTTS LJSpeech
glowtts-ljspeech_v0.1.2.mp4Tacotron2-DCA LJSpeechI changed the
tacotron2-dca-ljspeech_v0.1.2.mp4I think that this audio has also some problems, but I was not able to compare it with the the released version. SpeedySpeech LJSpeechI changed the
fails with a SC-GlowTTS VCTKI changed the
fails with Running SpeedySpeech and SC-GlowTTS-VCTK in earlier versions fails also, but with other errors. |
1 similar comment
To complete my report I did some inference tests in version 0.1.2. with the other released english models. Here are my findings: GlowTTS LJSpeech
glowtts-ljspeech_v0.1.2.mp4Tacotron2-DCA LJSpeechI changed the
tacotron2-dca-ljspeech_v0.1.2.mp4I think that this audio has also some problems, but I was not able to compare it with the the released version. SpeedySpeech LJSpeechI changed the
fails with a SC-GlowTTS VCTKI changed the
fails with Running SpeedySpeech and SC-GlowTTS-VCTK in earlier versions fails also, but with other errors. |
I found alignment images and this is how it looks at 180K: also this is the test audio that I found as well for 187K july13training_step187k.mp4Definitely very different audio and finally understandable! |
The alignment looks good enough. I guess the issue you experience is about a bug in the stopnet (which decides when to stop for the model). I am working on it and will release the fix soon. Until then, just keep training the model and after the release, you can continue training with the new version and fix the stopnet. |
@erogol Thank you for pushing the new update! I'm currently training with the same config used at the beginning of this issue, except I'm now training using your new V0.1.3 to see if the stop net was the problem and if I can finally replicate the inference quality in the current pretrained voice. |
Hello @erogol ! So I have made a few interesting observations lately regarding this issue: When I inference my model that I trained with your pretrained config, along with the stop net fix, I definitely get significantly better results with it compared to the one I trained before the stop net fix, this difference is clear regardless of vocoder used, that being said... When inferencing this better DDC model while using the pretrained hifigan as the vocoder, the inference quality is significantly worse than both multi-band melgan and griffin-lim, but from my experience, it should have significantly better quality than griffin-lim, and at least similar quality to multi-band melgan. Here is audio of my model inferencing with Griffin-Lim, Multi-band Melgan, and finally hifi-gan. 50k_default.mp450k_multiband_melgan.mp4Loudness warning 50k_hifigan.mp4On the contrary, the pretrained DDC model that comes with coqui works perfectly well with hifigan, but seems to be significantly worse audio quality when inferenced with multi-band melgan, it seems that the opposite trend is occurring compared to the DDC model I trained, with the supposedly same config. Funnily enough, the high pitch artifacts and features of the audio seem similar to what happens when I use my own ddc checkpoint on hifigan, Here is audio of that poor inference quality occurring with the Coqui pretrained DCC + multi-band melgan: output.mp4So overall, the stop net fix definitely helped, but there is something more occurring here unfortunately, that is preventing my Tacotron2DDC model from reaching the expected quality that Tacotron2DDC + Hifigan has to offer. |
Thanks for the update. Vocoder model should match the audio parameters of the tts model. Have you checked ? |
Ah it seems like that may be the problem, looks like I was using a slightly different config, I also noticed some things that might've affected training. I'm going to train from scratch again and make sure everything is correct, i'll post an update of how inferencing sounds with hifi-gan within the next few days. |
Ok, I double-checked this time that all audio parameters were the same across both Hifi-gan and Tacotron 2 DDC, also made sure I was using the new Stop net code, along with correctly calculating scale_stats.py which I had some mishaps with before. With all of this I trained a model with the config to about 80K steps and tried inferencing it again with Hifi-gan, unfortunately, that screeching high pitch feature is still strongly present, and it doesn't seem to have improved in that regard. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Discussed in #639
Originally posted by BillyBobQuebec July 10, 2021
I am training Tacotron2-DDC (LJ) from scratch using the recipe provided with no changes, tensorboard looks good to my eyes, but the alignment and duration seem to be way off when I actually try inferencing the audio, I'm suspecting it's a problem with it being inferenced improperly. specifically, the r-value that it is attempting to inference with. Here is the command that I used to initiate training:
Since the recipe uses gradual training which uses "r" as the starting value for the fine decoder (if I understand it correctly), but then changes it over time during training, I suspect it's using the starting r value during inference, instead of the latest r-value the fine decoder was at during training, when I try to force a different r value for inference (by passing through a recipe config with
"r": 2,
instead of"r": 6,
) it gives me this error:Here's the command used for inferencing and here's how it sounds at different points:
280k.mp4
110k.mp4
50k.mp4
The text was updated successfully, but these errors were encountered: