-
-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix batch-size on resume for multi-gpu #1942
Fix batch-size on resume for multi-gpu #1942
Conversation
@NanoCode012 I was looking back at the v4.0 release commit 69be8e7 and it seems I might have caused this by moving one line here in train.py: |
I had a strange situation where I was running some v3.1 trainings multi-GPU, and then when I resumed one I got CUDA OOM (my original run was very close to OOM already, and perhaps resume used just a little more memory), so I modified a few of these lines to allow for a --batch command on resume. I think I forgot to reset this one back to default. |
@NanoCode012 do you think updating the PR to simply revert the earlier change would fix this, or do you think the current PR is best? |
I don't think reverting fixes the problem. The problem is on-resume. The batch-size should be the previous "total-batchsize" for this Line 491 in b75c432
before it would be divided here Line 499 in b75c432
Edit: If you move that line up, I think you still need this PR. |
@NanoCode012 ok understood! Merging PR. |
Fixes #1936
Tested on my own and with author of the mentioned Issue.
Command:
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Improvement in resume training functionality for YOLOv5.
📊 Key Changes
train.py
script to updateopt.batch_size
properly when resuming training.🎯 Purpose & Impact
opt.total_batch_size
).This update is particularly beneficial for users who conduct long training sessions that may need to be paused and resumed due to various reasons, such as hardware limitations or scheduling constraints. 🔄