Distributed training with different log_vars keys among GPUs hangs the entire work process #1034

fingertap · 2021-11-12T08:47:04Z

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug

I met this bug in issue #1030. This bug is triggered when I used distributed training and some GPUs have different log_vars from others (e.g., containing an accuracy term only when some conditions are met).

The following is pasted from #1030:

In the _parse_log function of the mmseg.segmentors.base.BaseSegmentor, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop (line 194):
for loss_name, loss_value in log_vars.items():
   # reduce loss when distributed training
   if dist.is_available() and dist.is_initialized():
       loss_value = loss_value.data.clone()
       dist.all_reduce(loss_value.div_(dist.get_world_size()))
   log_vars[loss_name] = loss_value.item()
One GPU A does not have a "roi_acc" as loss_name (suppose it is the last key in log_vars). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last "roi_acc" will try to call torch.distributed.all_reduce, which infinitely waits for the reply from GPU A.

Reproduction

What command or script did you run?

The complete code is very complicated. I think the following code can reproduce this issue.

def forward_train(self, img, img_metas, **kwargs):
    losses = dict()
    rank = dist.get_rank()
    if rank != 6:
        losses['acc'] = torch.tensor(1.)
    return losses

This case is common, as the metrics may depend on the input data (e.g. only output the acc when there is an target in this image). This case is especially frequent when you have a batchsize=1. Passing 0 to the metric is not a good idea as it cannot reflect the real performance of the model.

Did you make any modifications on the code or config? Did you understand what you have modified?

No modification. I subclassing from mmseg and mmdet. Yes, I read the code.

What dataset did you use?

A custom dataset. This bug is dataset-agnostic.

Environment

Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.

This bug is environment-agnostic.

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Not relevant.

Error traceback

It just hangs without errors. Maybe after a long time, there will be a timeout.

Bug fix

This is about allowing users to pass a None log_vars. To do this, users must guarantee that the keys in log_vars is the same for all GPUs. For those log_vars they wanted to skip, the mmcv.runner.log_buffer.LogBuffer should support None values. I would like to create a PR for this when time allows.

The text was updated successfully, but these errors were encountered:

MengzhangLI · 2021-11-12T08:56:55Z

Hi, thanks for your report.

If your time available, could you make a pr so we could merge it into master branch ASAP.

Thanks for your great contribution!

Best,

fingertap · 2021-11-12T09:03:49Z

As this fix may be involving modifications in mmcv too, I'm not sure how to connect PRs between two repos.

MengzhangLI · 2021-11-15T08:29:25Z

Hi, @fingertap
Thank you sooo much for your great issue & pr.
I will close this issue and let’s keep in touch in your pr.
Best,

* [#1034] fix dist training infinite waiting issue * print log_vars keys in assertion msg * linting issue

TAICHIKF · 2022-02-13T13:56:16Z

嗨，@fingertap 非常感谢你的好问题和公关。我将关闭这个问题，让我们在你的公关中保持联系。最好的，

请问这个问题解决了吗

MengzhangLI · 2022-02-13T14:32:05Z

嗨，@fingertap 非常感谢你的好问题和公关。我将关闭这个问题，让我们在你的公关中保持联系。最好的，

请问这个问题解决了吗

Yes, related PR has fixed it. Please update latest mmseg and mmcv for better usage experience!

Best,

TAICHIKF · 2022-02-13T14:40:08Z

那解决的思路是什么呀<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <br/><br/><br/><div class="ntes-mailmaster-quote" style="padding-top: 1px; padding-bottom: 1px" > <div style=" margin-top: 2em; margin-bottom: 1em; font-size: 14px; line-height: 1.25; color: #78787a; " >---- Replied mail ----</div> <div style=" margin-bottom: 1em; font-size: 12px; line-height: 1.25; color: #232324; padding: 0.5em 0.25em; border-radius: 0.67em; background-color: #f0f0f0; " > <table width="100%" cellpadding="0" cellspacing="9" border="0"> <tr> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > From </td> <td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > <a class="mail-from" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a> </td> </tr> <tr> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > Date </td> <td class="mail-date" valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > 02/13/2022 22:32 </td> </tr> <tr style=""> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > To </td> <td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > <a class="mail-to" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a> </td> </tr> <tr style=""> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > Cc </td> <td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > <a class="mail-cc" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>、<a class="mail-cc" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a> </td> </tr> <tr> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > Subject </td> <td class="mail-subject" valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > Re: [open-mmlab/mmsegmentation] Distributed training with different log_vars keys among GPUs hangs the entire work process (Issue #1034) </td> </tr> </table> </div> <div><p></p> <blockquote> <blockquote> <p dir="auto">嗨，<a class="user-mention" data-hovercard-type="user" data-hovercard-url="/users/fingertap/hovercard" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" ***@***.***</a> 非常感谢你的好问题和公关。我将关闭这个问题，让我们在你的公关中保持联系。最好的，</p> </blockquote> <p dir="auto">请问这个问题解决了吗</p> </blockquote> <p dir="auto">Yes, related PR has fixed it. Please update latest mmseg and mmcv for better usage experience!</p> <p dir="auto">Best,</p> <p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />Reply to this email directly, <a href="#1034 (comment)">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AKVHBR7SASPFLETU2CPUUITU266HDANCNFSM5H4J2QEA">unsubscribe</a>.<br />Triage notifications on the go with GitHub Mobile for <a href="https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675">iOS</a> or <a href="https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub">Android</a>. <br />You are receiving this because you commented.<img src="https://github.com/notifications/beacon/AKVHBR7P5AIJ7VJG6WBNPLLU266HDA5CNFSM5H4J2QEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOHXQQXBY.gif" height="1" width="1" alt="" /><span style="color: transparent; font-size: 0; display: none; visibility: hidden; overflow: hidden; opacity: 0; width: 0; height: 0; max-width: 0; max-height: 0; mso-hide: all">Message ID: <span><open-mmlab/mmsegmentation/issues/1034/1038158727</span><span>@</span><span>github</span><span>.</span><span>com></span></span></p> <script type="application/ld+json">[ { ***@***.***": "http://schema.org", ***@***.***": "EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target": "#1034 (comment)", "url": "#1034 (comment)", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { ***@***.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]</script></div> </div> </body> </html>

fingertap · 2022-02-13T14:59:33Z

Include another cross-GPU communication to synchronize the log_vars before reducing losses.

* [open-mmlab#1034] fix dist training infinite waiting issue * print log_vars keys in assertion msg * linting issue

Remove some unused parameter The `downsample_padding` parameter does not seem to be used in `CrossAttnUpBlock2D` (or by any up block for that matter) so removing it.

* map location to cpu when load checkpoint (open-mmlab#1007) * [Enhancement] Support minus output feature index in mobilenet_v3 (open-mmlab#1005) * fix typo in mobilenet_v3 * fix typo in mobilenet_v3 * use -1 to indicate output tensors from final stage * support negative out_indices * [Enhancement] inference speed and flops tools. (open-mmlab#986) * add the function to test the dummy forward speed of models. * add tools to test the flops and inference speed of multiple models. * [Fix] Update pose tracking demo to be compatible with latest mmtrakcing (open-mmlab#1014) * update mmtracking demo * support both track_bboxes and track_results * add docstring * [Fix] fix skeleton_info of coco wholebody dataset (open-mmlab#1010) * fix wholebody base dataset * fix lint * fix lint Co-authored-by: ly015 <liyining0712@gmail.com> * [Feature] Add ViPNAS models for wholebody keypoint detection (open-mmlab#1009) * add configs * add dark configs * add checkpoint and readme * update webcam demo * fix model path in webcam demo * fix unittest * [Fix] Fix bbox label visualization (open-mmlab#1020) * update model metafiles (open-mmlab#1001) * update hourglass ae .md (open-mmlab#1027) * [Feature] Add ViPNAS mbv3 (open-mmlab#1025) * add vipnas mbv3 * test other variants * submission for mmpose * add unittest * add readme * update .yml * fix lint * rebase * fix pytest Co-authored-by: jin-s13 <jinsheng13@foxmail.com> * [Enhancement] Set a random seed when the user does not set a seed (open-mmlab#1030) * fix randseed * fix lint * fix import * fix isort * update yapf hook * revert yapf version * add cfg file for flops and speed test, change the bulid_posenet to init_pose_model and fix some typo in cfg (open-mmlab#1028) * [Enhancement] Add more functions for speed test tool (open-mmlab#1034) * add batch size and device args in speed test script, and remove MMDataParallel warper * add vipnas_mbv3 model * fix dead link (open-mmlab#1038) * Skip CI when some specific files were changed (open-mmlab#1041) * update sigmas (open-mmlab#1040) * add more configs, ckpts and logs for HRNet on PoseTrack18 (open-mmlab#1035) * [Feature] Add PoseWarper dataset (open-mmlab#1006) * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * modify some methods in the base class to improve code coverage rate * recover some mistakenly-deleted notes * remove test_dataset_info part for the new TopDownPoseTrack18VideoDataset class * cancel uncompleted previous runs (open-mmlab#1053) * [Doc] Add inference speed results (open-mmlab#1044) * add docs related to inference speed results * add corresponding Chinese docs and fix some typos * add Chinese docs in readthedocs * remove the massive table in readme * minor modification to wording Co-authored-by: ly015 <liyining0712@gmail.com> * [Feature] Add PoseWarper detector model (open-mmlab#932) * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add PoseWarper neck * modify PoseWarper detector and PoseWarperNeck * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * modify the posewarper detector description * modify some arguments and related fields * modify default values for some args * fix readthedoc bulid typo * fix ignore path (open-mmlab#1059) * [Doc] Add related docs for PoseWarper (open-mmlab#1036) * add related docs for PoseWarper * add related readme docs for posewarper * modify related args in posewarper stage2 config * modify posewarper stage2 config path * add description about val_boxes path for data preparation (open-mmlab#1060) * bump version to v0.21.0 (open-mmlab#1061) * [Feature] Add ViPNAS_Mbv3 wholebody model (open-mmlab#1055) * add vipnas mbv3 coco_wholebody * add vipnas mbv3 coco_wholebody md&yml * fix lint Co-authored-by: ly015 <liyining0712@gmail.com> Co-authored-by: Lumin <30328525+luminxu@users.noreply.github.com> Co-authored-by: zengwang430521 <zengwang430521@gmail.com> Co-authored-by: Jas <jinsheng@sensetime.com> Co-authored-by: jin-s13 <jinsheng13@foxmail.com> Co-authored-by: Qikai Li <87690686+liqikai9@users.noreply.github.com> Co-authored-by: QwQ2000 <396707050@qq.com>

fingertap mentioned this issue Nov 12, 2021

Distributed training hangs due to missing keys in mmseg.segmentors.base.BaseSegmentor._parse_losses #1030

Closed

Junjun2016 mentioned this issue Nov 15, 2021

[Fix] Fix dist training infinite waiting issue #1035

Merged

4 tasks

MengzhangLI closed this as completed Nov 15, 2021

bowenroom pushed a commit to bowenroom/mmsegmentation that referenced this issue Feb 25, 2022

[Fix] Fix dist training infinite waiting issue (open-mmlab#1035)

e06cbcd

* [open-mmlab#1034] fix dist training infinite waiting issue * print log_vars keys in assertion msg * linting issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training with different log_vars keys among GPUs hangs the entire work process #1034

Distributed training with different log_vars keys among GPUs hangs the entire work process #1034

fingertap commented Nov 12, 2021 •

edited

Loading

MengzhangLI commented Nov 12, 2021

fingertap commented Nov 12, 2021

MengzhangLI commented Nov 15, 2021

TAICHIKF commented Feb 13, 2022

MengzhangLI commented Feb 13, 2022

TAICHIKF commented Feb 13, 2022 via email

fingertap commented Feb 13, 2022

Distributed training with different log_vars keys among GPUs hangs the entire work process #1034

Distributed training with different log_vars keys among GPUs hangs the entire work process #1034

Comments

fingertap commented Nov 12, 2021 • edited Loading

MengzhangLI commented Nov 12, 2021

fingertap commented Nov 12, 2021

MengzhangLI commented Nov 15, 2021

TAICHIKF commented Feb 13, 2022

MengzhangLI commented Feb 13, 2022

TAICHIKF commented Feb 13, 2022 via email

fingertap commented Feb 13, 2022

fingertap commented Nov 12, 2021 •

edited

Loading