-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Avoid infinite GPU waiting in dist training #6501
Changes from all commits
cb262cd
2d26a9e
27ecb5e
cc61423
5805be0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -198,6 +198,16 @@ def _parse_losses(self, losses): | |
loss = sum(_value for _key, _value in log_vars.items() | ||
if 'loss' in _key) | ||
|
||
# If the loss_vars has different length, GPUs will wait infinitely | ||
if dist.is_available() and dist.is_initialized(): | ||
hhaAndroid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
log_var_length = torch.tensor(len(log_vars), device=loss.device) | ||
dist.all_reduce(log_var_length) | ||
hhaAndroid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
message = (f'rank {dist.get_rank()}' + | ||
f' len(log_vars): {len(log_vars)}' + ' keys: ' + | ||
','.join(log_vars.keys())) | ||
assert log_var_length == len(log_vars) * dist.get_world_size(), \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once an error occurs, print out all the keys in each process to facilitate troubleshooting. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does mmdet provide any thread-safe print utilities? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need to think about it, only need to be printed after asset, and keys and rank need to be printed at the same time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, how can other GPUs know that this GPU raises error? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that if we want to inform all GPUs about the assertion error, we have to do one more communication among GPUs. If this overhead is Okay, I will add it to this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no need to distinguish which GPU is wrong, just print all the keys, the user can check the output to determine which key is wrong
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you are right. At least two GPUs will raise exception, so users can compare the error messages. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes |
||
'loss log variables are different across GPUs!\n' + message | ||
|
||
log_vars['loss'] = loss | ||
for loss_name, loss_value in log_vars.items(): | ||
# reduce loss when distributed training | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to use rank, word_size = get_dist_info(), then use word_size to decide whether to synchronize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to broadcast the error. At least two GPUs will have their len(log_vars) different from the mean. So users can compare the error logs to determine the missing loss terms.