Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with lower precision crashes with runtime error #96

Closed
Laope94 opened this issue Jun 6, 2023 · 8 comments
Closed

Training with lower precision crashes with runtime error #96

Laope94 opened this issue Jun 6, 2023 · 8 comments

Comments

@Laope94
Copy link

Laope94 commented Jun 6, 2023

One more issue.
I can't use any other precission than 32. BF16 doesn't seem to possible on card I am running training on (TITAN RTX), but 16 should be. I am getting this error:
RuntimeError: "fill_cuda" not implemented for 'ComplexHalf'.

To further explain how I am training - I built docker image as described in readme (based on nvcr.io/nvidia/pytorch:22.03-py3).
I've mounted dataset, installed git inside running container and cloned this github repo. Then I've run build_monotonic_align.sh and attempted to start training, this time on single gpu.

First this issue https://stackoverflow.com/questions/75834134/attributeerror-trainer-object-has-no-attribute-lr-find has stopped me so I've downgraded lightning to 1.9. There is no version in dockerfile so just the newest is installed.

But then I got error above. I've also tried modifiy piper_train/vits/config.py and set fp16_run: bool = False to True, but no success. Only thing I could find for VITS is this issue jaywalnut310/vits#15 with several possible code modifications suggested. I'd appreciate if someone can take a look and point me to what should I modify. CUDA is 12.1.

@Laope94
Copy link
Author

Laope94 commented Jun 6, 2023

Okay, I'll answer this one myself actually. It seems that downgrading versions of few packages bundled with nvcr.io/nvidia/pytorch:22.03-py3 helps.

This works for me:
torch~=1.6
pytorch-lightning~=1.7
torchtext~=0.6 (not sure about this one, but pip has been complaining).

@synesthesiam
Copy link
Member

I would really like to understand how this can be fixed. I've tried the code modifications you mentioned before and none of them worked. I'm worried that the "fix" may require a complete rewrite of the VITS code :(

@beqabeqa473
Copy link

@Laope94 did you able to make a progress with this?

@Laope94
Copy link
Author

Laope94 commented Jun 12, 2023

@Laope94 did you able to make a progress with this?

Yes and no. It seems to be torch issue though. As I stated above, using torch 1.6 works, but I haven't achieved desired result. I am not able to fit batch bigger than 16 to memory and I hoped that lowering precision can help me a bit, but it's not really effective so I've continued with fp32 and batch size of 16.

@beqabeqa473
Copy link

beqabeqa473 commented Jun 12, 2023 via email

@Laope94
Copy link
Author

Laope94 commented Jun 14, 2023

i am giving batch size of 24 to my 12 gb gpu

I have GPU with 24GB available, but I am not able to fit any higher batch size than 16 (medium model, 22khz audio), crashes with CUDA OOM everytime. Even after numerous attempts with fp16, mixed precision, different library versions, max_split_size_mb, cuda malloc async backend... Just no, so I am working with smaller batch size.

@synesthesiam
Copy link
Member

@Laope94 This is likely because you have a few very long sentences in your training data. Because batches have to be padded out to the longest sentence length, these will cause OOM crashes.

I have a --max-phoneme-ids <N> option you can pass to the training script, which will drop sentences larger than N (and print how many were dropped). I usually set this to 400 so I can ensure a batch size of 32 on my RTX 3090's (24GB).

@Laope94
Copy link
Author

Laope94 commented Jun 16, 2023

@synesthesiam thanks, it seems that this really did the trick. I've tried few different values and I can go up to 700 comfortably. I'd maybe able to use even bigger batch size, but 32 is sufficient for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants