Skip to content

Distributed training with different log_vars keys among GPUs hangs the entire work process #1034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
fingertap opened this issue Nov 12, 2021 · 7 comments
Closed
2 tasks done

Comments

@fingertap
Copy link
Contributor

fingertap commented Nov 12, 2021

Checklist

  • I have searched related issues but cannot get the expected help.
  • The bug has not been fixed in the latest version.

Describe the bug

I met this bug in issue #1030. This bug is triggered when I used distributed training and some GPUs have different log_vars from others (e.g., containing an accuracy term only when some conditions are met).

The following is pasted from #1030:

In the _parse_log function of the mmseg.segmentors.base.BaseSegmentor, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop (line 194):

for loss_name, loss_value in log_vars.items():
   # reduce loss when distributed training
   if dist.is_available() and dist.is_initialized():
       loss_value = loss_value.data.clone()
       dist.all_reduce(loss_value.div_(dist.get_world_size()))
   log_vars[loss_name] = loss_value.item()

One GPU A does not have a "roi_acc" as loss_name (suppose it is the last key in log_vars). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last "roi_acc" will try to call torch.distributed.all_reduce, which infinitely waits for the reply from GPU A.

Reproduction

  1. What command or script did you run?

The complete code is very complicated. I think the following code can reproduce this issue.

def forward_train(self, img, img_metas, **kwargs):
    losses = dict()
    rank = dist.get_rank()
    if rank != 6:
        losses['acc'] = torch.tensor(1.)
    return losses

This case is common, as the metrics may depend on the input data (e.g. only output the acc when there is an target in this image). This case is especially frequent when you have a batchsize=1. Passing 0 to the metric is not a good idea as it cannot reflect the real performance of the model.

  1. Did you make any modifications on the code or config? Did you understand what you have modified?

No modification. I subclassing from mmseg and mmdet. Yes, I read the code.

  1. What dataset did you use?

A custom dataset. This bug is dataset-agnostic.

Environment

  1. Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.

This bug is environment-agnostic.

  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Not relevant.

Error traceback

It just hangs without errors. Maybe after a long time, there will be a timeout.

Bug fix

This is about allowing users to pass a None log_vars. To do this, users must guarantee that the keys in log_vars is the same for all GPUs. For those log_vars they wanted to skip, the mmcv.runner.log_buffer.LogBuffer should support None values. I would like to create a PR for this when time allows.

@MengzhangLI
Copy link
Contributor

Hi, thanks for your report.

If your time available, could you make a pr so we could merge it into master branch ASAP.

Thanks for your great contribution!

Best,

@fingertap
Copy link
Contributor Author

As this fix may be involving modifications in mmcv too, I'm not sure how to connect PRs between two repos.

@MengzhangLI
Copy link
Contributor

Hi, @fingertap
Thank you sooo much for your great issue & pr.
I will close this issue and let’s keep in touch in your pr.
Best,

Junjun2016 pushed a commit that referenced this issue Dec 8, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
* [#1034] fix dist training infinite waiting issue

* print log_vars keys in assertion msg

* linting issue
@TAICHIKF
Copy link

嗨,@fingertap 非常 感谢你的好问题和公关。 我将关闭这个问题,让我们在你的公关中保持联系。 最好的,

请问这个问题解决了吗

@MengzhangLI
Copy link
Contributor

嗨,@fingertap 非常 感谢你的好问题和公关。 我将关闭这个问题,让我们在你的公关中保持联系。 最好的,

请问这个问题解决了吗

Yes, related PR has fixed it. Please update latest mmseg and mmcv for better usage experience!

Best,

@TAICHIKF
Copy link

TAICHIKF commented Feb 13, 2022 via email

@fingertap
Copy link
Contributor Author

Include another cross-GPU communication to synchronize the log_vars before reducing losses.

bowenroom pushed a commit to bowenroom/mmsegmentation that referenced this issue Feb 25, 2022
* [open-mmlab#1034] fix dist training infinite waiting issue

* print log_vars keys in assertion msg

* linting issue
aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this issue Mar 27, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Remove some unused parameter

The `downsample_padding` parameter does not seem to be used in `CrossAttnUpBlock2D` (or by any up block for that matter) so removing it.
wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
* map location to cpu when load checkpoint (open-mmlab#1007)

* [Enhancement] Support minus output feature index in mobilenet_v3 (open-mmlab#1005)

* fix typo in mobilenet_v3

* fix typo in mobilenet_v3

* use -1 to indicate output tensors from final stage

* support negative out_indices

* [Enhancement] inference speed and flops tools. (open-mmlab#986)

* add the function to test the dummy forward speed of models.

* add tools to test the flops and inference speed of multiple models.

* [Fix] Update pose tracking demo to be compatible with latest mmtrakcing (open-mmlab#1014)

* update mmtracking demo

* support both track_bboxes and track_results

* add docstring

* [Fix] fix skeleton_info of coco wholebody dataset (open-mmlab#1010)

* fix wholebody base dataset

* fix lint

* fix lint

Co-authored-by: ly015 <liyining0712@gmail.com>

* [Feature] Add ViPNAS models for wholebody keypoint detection (open-mmlab#1009)

* add configs

* add dark configs

* add checkpoint and readme

* update webcam demo

* fix model path in webcam demo

* fix unittest

* [Fix] Fix bbox label visualization (open-mmlab#1020)

* update model metafiles (open-mmlab#1001)

* update hourglass ae .md (open-mmlab#1027)

* [Feature] Add ViPNAS mbv3 (open-mmlab#1025)

* add vipnas mbv3

* test other variants

* submission for mmpose

* add unittest

* add readme

* update .yml

* fix lint

* rebase

* fix pytest

Co-authored-by: jin-s13 <jinsheng13@foxmail.com>

* [Enhancement] Set a random seed when the user does not set a seed (open-mmlab#1030)

* fix randseed

* fix lint

* fix import

* fix isort

* update yapf hook

* revert yapf version

* add cfg file for flops and speed test,  change the bulid_posenet to init_pose_model and fix some typo in cfg (open-mmlab#1028)

* [Enhancement] Add more functions for speed test tool (open-mmlab#1034)

* add batch size and device args in speed test script, and remove MMDataParallel warper

* add vipnas_mbv3 model

* fix dead link (open-mmlab#1038)

* Skip CI when some specific files were changed (open-mmlab#1041)

* update sigmas (open-mmlab#1040)

* add more configs, ckpts and logs for HRNet on PoseTrack18 (open-mmlab#1035)

* [Feature] Add PoseWarper dataset (open-mmlab#1006)

* add PoseWarper dataset and base class

* modify pipelines related to video

* add unittest for PoseWarper dataset

* add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files

* fix typo

* fix unittest CI failure

* fix typo

* add PoseWarper dataset and base class

* modify pipelines related to video

* add unittest for PoseWarper dataset

* add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files

* fix typo

* fix unittest CI failure

* fix typo

* modify some methods in the base class to improve code coverage rate

* recover some mistakenly-deleted notes

* remove test_dataset_info part for the new TopDownPoseTrack18VideoDataset class

* cancel uncompleted previous runs (open-mmlab#1053)

* [Doc] Add inference speed results (open-mmlab#1044)

* add docs related to inference speed results

* add corresponding Chinese docs and fix some typos

* add Chinese docs in readthedocs

* remove the massive table in readme

* minor modification to wording

Co-authored-by: ly015 <liyining0712@gmail.com>

* [Feature] Add PoseWarper detector model (open-mmlab#932)

* Add top down video detector module

* Add PoseWarper neck

* add function _freeze_stages

* fix typo

* modify PoseWarper detector and PoseWarperNeck

* fix typo

* modify posewarper detector and neck

* Delete top_down_video.py

change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown`

* fix spell typo

* modify detector and neck

* add unittest for detector and neck

* modify unittest for posewarper forward

* Add top down video detector module

* Add PoseWarper neck

* add function _freeze_stages

* fix typo

* modify PoseWarper detector and PoseWarperNeck

* fix typo

* modify posewarper detector and neck

* Delete top_down_video.py

change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown`

* fix spell typo

* modify detector and neck

* add unittest for detector and neck

* modify unittest for posewarper forward

* modify dependency on mmcv version in posewarper neck

* reduce memory cost in test

* modify flops tool for more flexible input format

* Add top down video detector module

* Add PoseWarper neck

* add function _freeze_stages

* fix typo

* modify PoseWarper detector and PoseWarperNeck

* fix typo

* modify posewarper detector and neck

* Delete top_down_video.py

change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown`

* fix spell typo

* modify detector and neck

* add unittest for detector and neck

* modify unittest for posewarper forward

* Add PoseWarper neck

* modify PoseWarper detector and PoseWarperNeck

* modify posewarper detector and neck

* Delete top_down_video.py

change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown`

* fix spell typo

* modify detector and neck

* add unittest for detector and neck

* modify unittest for posewarper forward

* modify dependency on mmcv version in posewarper neck

* reduce memory cost in test

* modify flops tool for more flexible input format

* modify the posewarper detector description

* modify some arguments and related fields

* modify default values for some args

* fix readthedoc bulid typo

* fix ignore path (open-mmlab#1059)

* [Doc]  Add related docs for PoseWarper (open-mmlab#1036)

* add related docs for PoseWarper

* add related readme docs for posewarper

* modify related args in posewarper stage2 config

* modify posewarper stage2 config path

* add description about val_boxes path for data preparation (open-mmlab#1060)

* bump version to v0.21.0 (open-mmlab#1061)

* [Feature] Add ViPNAS_Mbv3 wholebody model (open-mmlab#1055)

* add vipnas mbv3 coco_wholebody

* add vipnas mbv3 coco_wholebody md&yml

* fix lint

Co-authored-by: ly015 <liyining0712@gmail.com>

Co-authored-by: Lumin <30328525+luminxu@users.noreply.github.com>
Co-authored-by: zengwang430521 <zengwang430521@gmail.com>
Co-authored-by: Jas <jinsheng@sensetime.com>
Co-authored-by: jin-s13 <jinsheng13@foxmail.com>
Co-authored-by: Qikai Li <87690686+liqikai9@users.noreply.github.com>
Co-authored-by: QwQ2000 <396707050@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants