-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Raising warning when the different number of GPU is set for resuming #1360
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1360 +/- ##
==========================================
- Coverage 69.14% 69.01% -0.13%
==========================================
Files 162 162
Lines 10746 10765 +19
Branches 1978 1984 +6
==========================================
Hits 7430 7430
- Misses 2927 2944 +17
- Partials 389 391 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…GPUs for resuming
if 'config' in checkpoint['meta']: | ||
config = mmcv.Config.fromstring( | ||
checkpoint['meta']['config'], file_format='.py') | ||
previous_gpu_ids = config.get('gpu_ids', None) | ||
if previous_gpu_ids and len(previous_gpu_ids) > 0 and len( | ||
previous_gpu_ids) != self.world_size: | ||
warnings.warn( | ||
f'The number of GPU is {len(previous_gpu_ids)} before \ | ||
resuming while the number of GPU is {len(self.world_size)}\ | ||
after resuming. It is better to set the same number of \ | ||
GPU for resuming.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the consistent behavior with BaseRunner
to modify the self._iter
mmcv/mmcv/runner/base_runner.py
Lines 371 to 382 in b92ea0b
# Re-calculate the number of iterations when resuming | |
# models with different number of GPUs | |
if 'config' in checkpoint['meta']: | |
config = mmcv.Config.fromstring( | |
checkpoint['meta']['config'], file_format='.py') | |
previous_gpu_ids = config.get('gpu_ids', None) | |
if previous_gpu_ids and len(previous_gpu_ids) > 0 and len( | |
previous_gpu_ids) != self.world_size: | |
self._iter = int(self._iter * len(previous_gpu_ids) / | |
self.world_size) | |
self.logger.info('the iteration number is changed due to ' | |
'change of GPU number') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there are no solid conclusions between optimizer status and batch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So? Should we rethink modifying self._iter
in BaseRunner
?
mmcv/mmcv/runner/base_runner.py
Lines 369 to 380 in e4b5348
# Re-calculate the number of iterations when resuming | |
# models with different number of GPUs | |
if 'config' in checkpoint['meta']: | |
config = mmcv.Config.fromstring( | |
checkpoint['meta']['config'], file_format='.py') | |
previous_gpu_ids = config.get('gpu_ids', None) | |
if previous_gpu_ids and len(previous_gpu_ids) > 0 and len( | |
previous_gpu_ids) != self.world_size: | |
self._iter = int(self._iter * len(previous_gpu_ids) / | |
self.world_size) | |
self.logger.info('the iteration number is changed due to ' | |
'change of GPU number') |
# since the optimizer status are relative with batch size | ||
# (#GPUs x bs/GPU) | ||
if 'config' in checkpoint['meta']: | ||
config = mmcv.Config.fromstring( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also check the difference of config between checkpoint and current.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future.
Suggest only raising warnings to remind users because we are not sure the intention of users. |
Might also raise a warning in |
Motivation
Raising warning when the different number of GPU is set for resuming
Modification
The
resume
function initer_based_rnner
.BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
No.
Checklist
Before PR:
After PR: