You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have two NVIDIA A10 24GB. When I start the finetune.py script I recieve this error on batch size 24. Can you help me understand what is the real problem? Thank's.
The text was updated successfully, but these errors were encountered:
I encountered this errror as well. However, in my case, after I added "os.environ["CUDA_VISIBLE_DEVICES"] = "1" to restrict to use only one gpu, the problem was resolved.
Thank you for your response. Indeed, this solved not only this problem, but for some reason only with this approach to launching on one gpu, 24GB of memory is enough for training.
Training on multiple GPUs is possible with torchrun, you'll double batch size and half the training time.
Take care of veryfiying the math by hand so gradient accumulation steps is right.
Training on multiple GPUs is possible with torchrun, you'll double batch size and half the training time. Take care of veryfiying the math by hand so gradient accumulation steps is right.
I have two NVIDIA A10 24GB. When I start the finetune.py script I recieve this error on batch size 24. Can you help me understand what is the real problem? Thank's.
The text was updated successfully, but these errors were encountered: