Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stuck #71

Open
Madhavan0123 opened this issue Nov 14, 2023 · 18 comments
Open

Training stuck #71

Madhavan0123 opened this issue Nov 14, 2023 · 18 comments

Comments

@Madhavan0123
Copy link

Hello ,

Thanks for all the effort to create this repo. When I launch training it runs for a few steps and then I see no progress at all. Its just stuck without any progress for a long time. It still hasnt progressed.
INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/G_0.pth
INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/D_0.pth
INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/DUR_0.pth
Loading train data: 4%|████████████▍

have you encountered this before ? Any help would be extremely helpful

@p0p4k
Copy link
Owner

p0p4k commented Nov 15, 2023

Hi, temporarily turn off duration discriminator and tell me if it works.

@Madhavan0123
Copy link
Author

Yes it seems to be working for now. Any reason with the duration discriminator is causing the issue ?

@p0p4k
Copy link
Owner

p0p4k commented Nov 15, 2023

I feel my implementation was too naive. Might need to correct it with some testing. Busy on other models now, will do it when I have some time on me. Let me know about the audio quality after you train. Thanks.

@CreepJoye
Copy link

Hello,thank you for your great effort !
I meet the same problem and I want to know will you correct it recently or still busy on other models?

@p0p4k
Copy link
Owner

p0p4k commented Dec 12, 2023

I have moved to improving pflowtts.

@JohnHerry
Copy link

I have moved to improving pflowtts.

Hi, p0p4k,
How is the pflowtts growing now? is it a better choice then vits2? can it support both normal tts and zero-shot tts?

@p0p4k
Copy link
Owner

p0p4k commented Jan 19, 2024

I think better than vits/vits2. Only downside it not being e2e.

@JohnHerry
Copy link

I think better than vits/vits2. Only downside it not being e2e.

ok, thank you.

@codeghees
Copy link

@p0p4k do you know the bug here?

@p0p4k
Copy link
Owner

p0p4k commented Mar 14, 2024

@codeghees which part ? training stuck part?

@codeghees
Copy link

yep

@codeghees
Copy link

In the same boat.

@p0p4k
Copy link
Owner

p0p4k commented Mar 14, 2024

@codeghees did not look into this personally cause of no gpu yet. Maybe you can try to debug and send a PR. I can assist you. Thanks a lot!

@codeghees
Copy link

Yep, will do! Trying to debug this.

@codeghees
Copy link

@p0p4k bug is on line
scaler.scale(loss_gen_all).backward()

Seems like GradScalar has issues with multi-gpu. I removed it and replaced it with standard backprop. The issue persists. Looks like a multi GPU issue.

@p0p4k
Copy link
Owner

p0p4k commented Mar 21, 2024

Works on single gpu?

@codeghees
Copy link

Haven't tested yet. Trying a run with fp16 enabled.

@farzanehnakhaee70
Copy link

@p0p4k I have no issues for single GPU training. But it will stuck if I do multiple GPU training. Any success for resolving the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants