-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Fix dist training infinite waiting issue #1035
Conversation
Hi @fingertap |
The unittest fails, but I think it is not due to this fix. |
Could you please provide more cases? |
What cases do you need? I encountered this issue when using a batch size = 1 and the length of log_vars in different GPUs differs. This will cause some GPU to leave the for-loop earlier than others, leading to the rest of GPUs waiting for responses forever. |
Will fix it in this PR #1037. |
Could you provide the missed keys in |
This is defined by user. For example, say I have a model that predicts segmentation mask for the image. The image has a boolean flag. We only calculate the loss when the flag is true. Then for images with a false flag, we do not need to calculate accuracy for it. In this case, I may implement something in my if flag:
losses['acc'] = accuracy(pred, label)
return losses When parsing this |
For more details, I quote the explanation from #1030 :
|
I see. |
We can inform the user about this issue, instead of letting the GPUs hang. I did this by triggering an assertion error: # If the loss_vars has different length, GPUs will wait infinitely
if dist.is_available() and dist.is_initialized():
log_var_length = torch.tensor(len(log_vars), device=loss.device)
dist.all_reduce(log_var_length)
assert log_var_length == len(log_vars) * dist.get_world_size(), \
'loss log variables are different across GPUs!' To allow the users skipping these loss terms, the users should explicitly set these loss terms as # Don't do this
if flag:
losses['acc'] = accuracy(pred, label)
# Do this
losses['acc'] = accuracy(pred, label) if flag else None It is the users' responsibility to ensure these losses have identical keys among GPUs to prevent synchronization issues. What we need to do is to allow such @staticmethod
def _parse_losses(losses):
"""Parse the raw outputs (losses) of the network.
Args:
losses (dict): Raw output of the network, which usually contain
losses and other necessary information.
Returns:
tuple[Tensor, dict]: (loss, log_vars), loss is the loss tensor
which may be a weighted sum of all losses, log_vars contains
all the variables to be sent to the logger.
"""
log_vars = OrderedDict()
for loss_name, loss_value in losses.items():
if loss_value is None: # HERE
log_vars[loss_name] = None
elif isinstance(loss_value, torch.Tensor):
log_vars[loss_name] = loss_value.mean()
elif isinstance(loss_value, list):
log_vars[loss_name] = sum(_loss.mean() for _loss in loss_value)
else:
raise TypeError(
f'{loss_name} is not a tensor or list of tensors')
loss = sum(_value for _key, _value in log_vars.items()
if 'loss' in _key)
# HERE
# If the loss_vars has different length, GPUs will wait infinitely
if dist.is_available() and dist.is_initialized():
log_var_length = torch.tensor(len(log_vars), device=loss.device)
dist.all_reduce(log_var_length)
assert log_var_length == len(log_vars) * dist.get_world_size(), \
'loss log variables are different across GPUs!'
log_vars['loss'] = loss
for loss_name, loss_value in log_vars.items():
# reduce loss when distributed training
# HERE
if dist.is_available() and dist.is_initialized():
if loss_value is not None:
sync_value = torch.tensor(
[loss_value.data.item(), 1.], device=loss.device)
else:
sync_value = torch.tensor([0., 0.], device=loss.device)
dist.all_reduce(sync_value)
# calculate mean if loss_value in some GPUs are not None
if not sync_value[1] == 0.:
log_vars[loss_name] = sync_value[0] / sync_value[1]
log_vars = {k:v.item() for k, v in log_vars.items() if v is not None}
return loss, log_vars |
However, in this PR I did not include my fix for the |
Hi @fingertap |
@hhaAndroid |
The merging is blocked because of the test_mit testcase failed, which I think has nothing to do with my fix. |
Done |
Has been fixed in the master, so you could merge the upstream master.<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<br/><br/><br/>
<div style="display: inline-block;">
<HR width=100% align=left size=0.5 color=#eaebec noshade>
<font color='#78787A'> 发自 </font>
<a href='https://mail.163.com/dashi/dlpro.html?from=mail81' target='_blank' style='text-decoration: none; color: #111112'>网易邮箱大师</a>
</div>
<br/><br/><br/><div class="ntes-mailmaster-quote" style="padding-top: 1px; padding-bottom: 1px" >
<div style=" margin-top: 2em; margin-bottom: 1em; font-size: 14px; line-height: 1.25; color: #78787a; " >---- 回复的原邮件 ----</div>
<div style=" margin-bottom: 1em; font-size: 12px; line-height: 1.25; color: #232324; padding: 0.5em 0.25em; border-radius: 0.67em; background-color: #f0f0f0; " >
<table width="100%" cellpadding="0" cellspacing="9" border="0">
<tr>
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
发件人
</td>
<td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
<a class="mail-from" style="text-decoration:none;color:#0886e8;" ***@***.***">Han ***@***.***></a>
</td>
</tr>
<tr>
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
日期
</td>
<td class="mail-date" valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
2021年11月15日 10:58
</td>
</tr>
<tr style="">
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
收件人
</td>
<td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
<a class="mail-to" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>
</td>
</tr>
<tr style="">
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
抄送至
</td>
<td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
<a class="mail-cc" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>、<a class="mail-cc" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>
</td>
</tr>
<tr>
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
主题
</td>
<td class="mail-subject" valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
Re: [open-mmlab/mmsegmentation] [#1034] fix dist training infinite waiting issue (PR #1035)
</td>
</tr>
</table>
</div>
<div><p></p>
<blockquote>
<p dir="auto">Hi <a class="user-mention" data-hovercard-type="user" data-hovercard-url="/users/fingertap/hovercard" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" ***@***.***</a> Thanks for your nice suggestions, we will discuss more about it. Please merge the master to fix the failed CI.</p>
</blockquote>
<p dir="auto">The merging is blocked because of the test_mit testcase failed, which I think has nothing to do with my fix.</p>
<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />You are receiving this because you commented.<br />Reply to this email directly, <a href="#1035 (comment)">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AD7REKJWZZNE6GJNDLEHOUTUMBZL3ANCNFSM5H6DJPBQ">unsubscribe</a>.<br />Triage notifications on the go with GitHub Mobile for <a href="https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675">iOS</a> or <a href="https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub">Android</a>.
<img src="https://github.com/notifications/beacon/AD7REKIABDFMBXXMBI3LG7LUMBZL3A5CNFSM5H6DJPB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOHG46FLQ.gif" height="1" width="1" alt="" /></p>
<script type="application/ld+json">[
{
***@***.***": "http://schema.org",
***@***.***": "EmailMessage",
"potentialAction": {
***@***.***": "ViewAction",
"target": "#1035 (comment)",
"url": "#1035 (comment)",
"name": "View Pull Request"
},
"description": "View this Pull Request on GitHub",
"publisher": {
***@***.***": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]</script></div>
</div>
</body>
</html>
|
The BC-breaking has been fixed in the master branch, so you can merge the master branch to your branch. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1035 +/- ##
==========================================
- Coverage 89.55% 89.48% -0.07%
==========================================
Files 119 119
Lines 6626 6631 +5
Branches 1034 1035 +1
==========================================
Hits 5934 5934
- Misses 488 492 +4
- Partials 204 205 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Fixed the linting issue and pulled the upstream master. |
Hi @fingertap |
Hi @hhaAndroid |
|
* [open-mmlab#1034] fix dist training infinite waiting issue * print log_vars keys in assertion msg * linting issue
* remove batch size from repeat * repeat empty string if uncond_tokens is none * fix inpaint pipes * return back whitespace to pass code quality * Apply suggestions from code review * Fix typos. Co-authored-by: Had <had-95@yandex.ru>
* map location to cpu when load checkpoint (open-mmlab#1007) * [Enhancement] Support minus output feature index in mobilenet_v3 (open-mmlab#1005) * fix typo in mobilenet_v3 * fix typo in mobilenet_v3 * use -1 to indicate output tensors from final stage * support negative out_indices * [Enhancement] inference speed and flops tools. (open-mmlab#986) * add the function to test the dummy forward speed of models. * add tools to test the flops and inference speed of multiple models. * [Fix] Update pose tracking demo to be compatible with latest mmtrakcing (open-mmlab#1014) * update mmtracking demo * support both track_bboxes and track_results * add docstring * [Fix] fix skeleton_info of coco wholebody dataset (open-mmlab#1010) * fix wholebody base dataset * fix lint * fix lint Co-authored-by: ly015 <liyining0712@gmail.com> * [Feature] Add ViPNAS models for wholebody keypoint detection (open-mmlab#1009) * add configs * add dark configs * add checkpoint and readme * update webcam demo * fix model path in webcam demo * fix unittest * [Fix] Fix bbox label visualization (open-mmlab#1020) * update model metafiles (open-mmlab#1001) * update hourglass ae .md (open-mmlab#1027) * [Feature] Add ViPNAS mbv3 (open-mmlab#1025) * add vipnas mbv3 * test other variants * submission for mmpose * add unittest * add readme * update .yml * fix lint * rebase * fix pytest Co-authored-by: jin-s13 <jinsheng13@foxmail.com> * [Enhancement] Set a random seed when the user does not set a seed (open-mmlab#1030) * fix randseed * fix lint * fix import * fix isort * update yapf hook * revert yapf version * add cfg file for flops and speed test, change the bulid_posenet to init_pose_model and fix some typo in cfg (open-mmlab#1028) * [Enhancement] Add more functions for speed test tool (open-mmlab#1034) * add batch size and device args in speed test script, and remove MMDataParallel warper * add vipnas_mbv3 model * fix dead link (open-mmlab#1038) * Skip CI when some specific files were changed (open-mmlab#1041) * update sigmas (open-mmlab#1040) * add more configs, ckpts and logs for HRNet on PoseTrack18 (open-mmlab#1035) * [Feature] Add PoseWarper dataset (open-mmlab#1006) * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * modify some methods in the base class to improve code coverage rate * recover some mistakenly-deleted notes * remove test_dataset_info part for the new TopDownPoseTrack18VideoDataset class * cancel uncompleted previous runs (open-mmlab#1053) * [Doc] Add inference speed results (open-mmlab#1044) * add docs related to inference speed results * add corresponding Chinese docs and fix some typos * add Chinese docs in readthedocs * remove the massive table in readme * minor modification to wording Co-authored-by: ly015 <liyining0712@gmail.com> * [Feature] Add PoseWarper detector model (open-mmlab#932) * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add PoseWarper neck * modify PoseWarper detector and PoseWarperNeck * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * modify the posewarper detector description * modify some arguments and related fields * modify default values for some args * fix readthedoc bulid typo * fix ignore path (open-mmlab#1059) * [Doc] Add related docs for PoseWarper (open-mmlab#1036) * add related docs for PoseWarper * add related readme docs for posewarper * modify related args in posewarper stage2 config * modify posewarper stage2 config path * add description about val_boxes path for data preparation (open-mmlab#1060) * bump version to v0.21.0 (open-mmlab#1061) * [Feature] Add ViPNAS_Mbv3 wholebody model (open-mmlab#1055) * add vipnas mbv3 coco_wholebody * add vipnas mbv3 coco_wholebody md&yml * fix lint Co-authored-by: ly015 <liyining0712@gmail.com> Co-authored-by: Lumin <30328525+luminxu@users.noreply.github.com> Co-authored-by: zengwang430521 <zengwang430521@gmail.com> Co-authored-by: Jas <jinsheng@sensetime.com> Co-authored-by: jin-s13 <jinsheng13@foxmail.com> Co-authored-by: Qikai Li <87690686+liqikai9@users.noreply.github.com> Co-authored-by: QwQ2000 <396707050@qq.com>
Motivation
If the log_vars has different length, GPUs will wait infinitely. This PR provides a fix for this by raising an assertion error when different GPUs have different length of log_vars.
Related: #1034
Modification
Add a cross-GPU communication to determine whether the GPUs have the same log_var length. If not, raise an assertion error.
BC-breaking (Optional)
None.
Use cases (Optional)
None.
Checklist