-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with lower precision crashes with runtime error #96
Comments
Okay, I'll answer this one myself actually. It seems that downgrading versions of few packages bundled with nvcr.io/nvidia/pytorch:22.03-py3 helps. This works for me: |
I would really like to understand how this can be fixed. I've tried the code modifications you mentioned before and none of them worked. I'm worried that the "fix" may require a complete rewrite of the VITS code :( |
@Laope94 did you able to make a progress with this? |
Yes and no. It seems to be torch issue though. As I stated above, using torch 1.6 works, but I haven't achieved desired result. I am not able to fit batch bigger than 16 to memory and I hoped that lowering precision can help me a bit, but it's not really effective so I've continued with fp32 and batch size of 16. |
i am giving batch size of 24 to my 12 gb gpu
…On 6/12/23, Laope94 ***@***.***> wrote:
> @Laope94 did you able to make a progress with this?
Yes and no. It seems to be torch issue though. As I stated above, using
torch 1.6 works, but I haven't achieved desired result. I am not able to fit
batch bigger than 16 to memory and I hoped that lowering precision can help
me a bit, but it's not really effective so I've continued with fp32 and
batch size of 16.
--
Reply to this email directly or view it on GitHub:
#96 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
--
with best regards Beqa Gozalishvili
Tell: +995593454005
Email: ***@***.***
Web: https://gozaltech.org
Skype: beqabeqa473
Telegram: https://t.me/gozaltech
facebook: https://facebook.com/gozaltech
twitter: https://twitter.com/beqabeqa473
Instagram: https://instagram.com/beqa.gozalishvili
|
I have GPU with 24GB available, but I am not able to fit any higher batch size than 16 (medium model, 22khz audio), crashes with CUDA OOM everytime. Even after numerous attempts with fp16, mixed precision, different library versions, max_split_size_mb, cuda malloc async backend... Just no, so I am working with smaller batch size. |
@Laope94 This is likely because you have a few very long sentences in your training data. Because batches have to be padded out to the longest sentence length, these will cause OOM crashes. I have a |
@synesthesiam thanks, it seems that this really did the trick. I've tried few different values and I can go up to 700 comfortably. I'd maybe able to use even bigger batch size, but 32 is sufficient for me. |
One more issue.
I can't use any other precission than 32. BF16 doesn't seem to possible on card I am running training on (TITAN RTX), but 16 should be. I am getting this error:
RuntimeError: "fill_cuda" not implemented for 'ComplexHalf'
.To further explain how I am training - I built docker image as described in readme (based on nvcr.io/nvidia/pytorch:22.03-py3).
I've mounted dataset, installed git inside running container and cloned this github repo. Then I've run build_monotonic_align.sh and attempted to start training, this time on single gpu.
First this issue https://stackoverflow.com/questions/75834134/attributeerror-trainer-object-has-no-attribute-lr-find has stopped me so I've downgraded lightning to 1.9. There is no version in dockerfile so just the newest is installed.
But then I got error above. I've also tried modifiy piper_train/vits/config.py and set fp16_run: bool = False to True, but no success. Only thing I could find for VITS is this issue jaywalnut310/vits#15 with several possible code modifications suggested. I'd appreciate if someone can take a look and point me to what should I modify. CUDA is 12.1.
The text was updated successfully, but these errors were encountered: