-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Distributed training with different log_vars keys among GPUs hangs the entire work process #1034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, thanks for your report. If your time available, could you make a pr so we could merge it into master branch ASAP. Thanks for your great contribution! Best, |
As this fix may be involving modifications in |
Hi, @fingertap |
* [#1034] fix dist training infinite waiting issue * print log_vars keys in assertion msg * linting issue
请问这个问题解决了吗 |
Yes, related PR has fixed it. Please update latest mmseg and mmcv for better usage experience! Best, |
那解决的思路是什么呀<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<br/><br/><br/><div class="ntes-mailmaster-quote" style="padding-top: 1px; padding-bottom: 1px" >
<div style=" margin-top: 2em; margin-bottom: 1em; font-size: 14px; line-height: 1.25; color: #78787a; " >---- Replied mail ----</div>
<div style=" margin-bottom: 1em; font-size: 12px; line-height: 1.25; color: #232324; padding: 0.5em 0.25em; border-radius: 0.67em; background-color: #f0f0f0; " >
<table width="100%" cellpadding="0" cellspacing="9" border="0">
<tr>
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
From
</td>
<td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
<a class="mail-from" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>
</td>
</tr>
<tr>
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
Date
</td>
<td class="mail-date" valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
02/13/2022 22:32
</td>
</tr>
<tr style="">
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
To
</td>
<td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
<a class="mail-to" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>
</td>
</tr>
<tr style="">
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
Cc
</td>
<td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
<a class="mail-cc" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>、<a class="mail-cc" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>
</td>
</tr>
<tr>
<td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " >
Subject
</td>
<td class="mail-subject" valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " >
Re: [open-mmlab/mmsegmentation] Distributed training with different log_vars keys among GPUs hangs the entire work process (Issue #1034)
</td>
</tr>
</table>
</div>
<div><p></p>
<blockquote>
<blockquote>
<p dir="auto">嗨,<a class="user-mention" data-hovercard-type="user" data-hovercard-url="/users/fingertap/hovercard" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" ***@***.***</a> 非常 感谢你的好问题和公关。 我将关闭这个问题,让我们在你的公关中保持联系。 最好的,</p>
</blockquote>
<p dir="auto">请问这个问题解决了吗</p>
</blockquote>
<p dir="auto">Yes, related PR has fixed it. Please update latest mmseg and mmcv for better usage experience!</p>
<p dir="auto">Best,</p>
<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />Reply to this email directly, <a href="#1034 (comment)">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AKVHBR7SASPFLETU2CPUUITU266HDANCNFSM5H4J2QEA">unsubscribe</a>.<br />Triage notifications on the go with GitHub Mobile for <a href="https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675">iOS</a> or <a href="https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub">Android</a>.
<br />You are receiving this because you commented.<img src="https://github.com/notifications/beacon/AKVHBR7P5AIJ7VJG6WBNPLLU266HDA5CNFSM5H4J2QEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOHXQQXBY.gif" height="1" width="1" alt="" /><span style="color: transparent; font-size: 0; display: none; visibility: hidden; overflow: hidden; opacity: 0; width: 0; height: 0; max-width: 0; max-height: 0; mso-hide: all">Message ID: <span><open-mmlab/mmsegmentation/issues/1034/1038158727</span><span>@</span><span>github</span><span>.</span><span>com></span></span></p>
<script type="application/ld+json">[
{
***@***.***": "http://schema.org",
***@***.***": "EmailMessage",
"potentialAction": {
***@***.***": "ViewAction",
"target": "#1034 (comment)",
"url": "#1034 (comment)",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
***@***.***": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]</script></div>
</div>
</body>
</html>
|
Include another cross-GPU communication to synchronize the |
* [open-mmlab#1034] fix dist training infinite waiting issue * print log_vars keys in assertion msg * linting issue
Remove some unused parameter The `downsample_padding` parameter does not seem to be used in `CrossAttnUpBlock2D` (or by any up block for that matter) so removing it.
* map location to cpu when load checkpoint (open-mmlab#1007) * [Enhancement] Support minus output feature index in mobilenet_v3 (open-mmlab#1005) * fix typo in mobilenet_v3 * fix typo in mobilenet_v3 * use -1 to indicate output tensors from final stage * support negative out_indices * [Enhancement] inference speed and flops tools. (open-mmlab#986) * add the function to test the dummy forward speed of models. * add tools to test the flops and inference speed of multiple models. * [Fix] Update pose tracking demo to be compatible with latest mmtrakcing (open-mmlab#1014) * update mmtracking demo * support both track_bboxes and track_results * add docstring * [Fix] fix skeleton_info of coco wholebody dataset (open-mmlab#1010) * fix wholebody base dataset * fix lint * fix lint Co-authored-by: ly015 <liyining0712@gmail.com> * [Feature] Add ViPNAS models for wholebody keypoint detection (open-mmlab#1009) * add configs * add dark configs * add checkpoint and readme * update webcam demo * fix model path in webcam demo * fix unittest * [Fix] Fix bbox label visualization (open-mmlab#1020) * update model metafiles (open-mmlab#1001) * update hourglass ae .md (open-mmlab#1027) * [Feature] Add ViPNAS mbv3 (open-mmlab#1025) * add vipnas mbv3 * test other variants * submission for mmpose * add unittest * add readme * update .yml * fix lint * rebase * fix pytest Co-authored-by: jin-s13 <jinsheng13@foxmail.com> * [Enhancement] Set a random seed when the user does not set a seed (open-mmlab#1030) * fix randseed * fix lint * fix import * fix isort * update yapf hook * revert yapf version * add cfg file for flops and speed test, change the bulid_posenet to init_pose_model and fix some typo in cfg (open-mmlab#1028) * [Enhancement] Add more functions for speed test tool (open-mmlab#1034) * add batch size and device args in speed test script, and remove MMDataParallel warper * add vipnas_mbv3 model * fix dead link (open-mmlab#1038) * Skip CI when some specific files were changed (open-mmlab#1041) * update sigmas (open-mmlab#1040) * add more configs, ckpts and logs for HRNet on PoseTrack18 (open-mmlab#1035) * [Feature] Add PoseWarper dataset (open-mmlab#1006) * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * modify some methods in the base class to improve code coverage rate * recover some mistakenly-deleted notes * remove test_dataset_info part for the new TopDownPoseTrack18VideoDataset class * cancel uncompleted previous runs (open-mmlab#1053) * [Doc] Add inference speed results (open-mmlab#1044) * add docs related to inference speed results * add corresponding Chinese docs and fix some typos * add Chinese docs in readthedocs * remove the massive table in readme * minor modification to wording Co-authored-by: ly015 <liyining0712@gmail.com> * [Feature] Add PoseWarper detector model (open-mmlab#932) * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add PoseWarper neck * modify PoseWarper detector and PoseWarperNeck * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * modify the posewarper detector description * modify some arguments and related fields * modify default values for some args * fix readthedoc bulid typo * fix ignore path (open-mmlab#1059) * [Doc] Add related docs for PoseWarper (open-mmlab#1036) * add related docs for PoseWarper * add related readme docs for posewarper * modify related args in posewarper stage2 config * modify posewarper stage2 config path * add description about val_boxes path for data preparation (open-mmlab#1060) * bump version to v0.21.0 (open-mmlab#1061) * [Feature] Add ViPNAS_Mbv3 wholebody model (open-mmlab#1055) * add vipnas mbv3 coco_wholebody * add vipnas mbv3 coco_wholebody md&yml * fix lint Co-authored-by: ly015 <liyining0712@gmail.com> Co-authored-by: Lumin <30328525+luminxu@users.noreply.github.com> Co-authored-by: zengwang430521 <zengwang430521@gmail.com> Co-authored-by: Jas <jinsheng@sensetime.com> Co-authored-by: jin-s13 <jinsheng13@foxmail.com> Co-authored-by: Qikai Li <87690686+liqikai9@users.noreply.github.com> Co-authored-by: QwQ2000 <396707050@qq.com>
Checklist
Describe the bug
I met this bug in issue #1030. This bug is triggered when I used distributed training and some GPUs have different
log_vars
from others (e.g., containing an accuracy term only when some conditions are met).The following is pasted from #1030:
Reproduction
The complete code is very complicated. I think the following code can reproduce this issue.
This case is common, as the metrics may depend on the input data (e.g. only output the acc when there is an target in this image). This case is especially frequent when you have a batchsize=1. Passing 0 to the metric is not a good idea as it cannot reflect the real performance of the model.
No modification. I subclassing from
mmseg
andmmdet
. Yes, I read the code.A custom dataset. This bug is dataset-agnostic.
Environment
python mmseg/utils/collect_env.py
to collect necessary environment information and paste it here.This bug is environment-agnostic.
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)Not relevant.
Error traceback
It just hangs without errors. Maybe after a long time, there will be a timeout.
Bug fix
This is about allowing users to pass a
None
log_vars. To do this, users must guarantee that the keys in log_vars is the same for all GPUs. For those log_vars they wanted to skip, themmcv.runner.log_buffer.LogBuffer
should supportNone
values. I would like to create a PR for this when time allows.The text was updated successfully, but these errors were encountered: