Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Avoid infinite GPU waiting in dist training #6501

Merged
merged 5 commits into from
Nov 24, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions mmdet/models/detectors/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,16 @@ def _parse_losses(self, losses):
loss = sum(_value for _key, _value in log_vars.items()
if 'loss' in _key)

# If the loss_vars has different length, GPUs will wait infinitely
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to use rank, word_size = get_dist_info(), then use word_size to decide whether to synchronize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to broadcast the error. At least two GPUs will have their len(log_vars) different from the mean. So users can compare the error logs to determine the missing loss terms.

if dist.is_available() and dist.is_initialized():
hhaAndroid marked this conversation as resolved.
Show resolved Hide resolved
log_var_length = torch.tensor(len(log_vars), device=loss.device)
dist.all_reduce(log_var_length)
hhaAndroid marked this conversation as resolved.
Show resolved Hide resolved
message = (f'rank {dist.get_rank()}' +
f' len(log_vars): {len(log_vars)}' + ' keys: ' +
','.join(log_vars.keys()))
assert log_var_length == len(log_vars) * dist.get_world_size(), \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once an error occurs, print out all the keys in each process to facilitate troubleshooting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does mmdet provide any thread-safe print utilities?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to think about it, only need to be printed after asset, and keys and rank need to be printed at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how can other GPUs know that this GPU raises error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that if we want to inform all GPUs about the assertion error, we have to do one more communication among GPUs. If this overhead is Okay, I will add it to this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to distinguish which GPU is wrong, just print all the keys, the user can check the output to determine which key is wrong

assert Flase, f'{...get_rank(), log_val.keys(), len(log_val.keys())...}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. At least two GPUs will raise exception, so users can compare the error messages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

'loss log variables are different across GPUs!\n' + message

log_vars['loss'] = loss
for loss_name, loss_value in log_vars.items():
# reduce loss when distributed training
Expand Down