Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Fix dist training infinite waiting issue #1035

Merged
merged 7 commits into from
Dec 8, 2021
Merged

[Fix] Fix dist training infinite waiting issue #1035

merged 7 commits into from
Dec 8, 2021

Conversation

fingertap
Copy link
Contributor

@fingertap fingertap commented Nov 13, 2021

Motivation

If the log_vars has different length, GPUs will wait infinitely. This PR provides a fix for this by raising an assertion error when different GPUs have different length of log_vars.

Related: #1034

Modification

Add a cross-GPU communication to determine whether the GPUs have the same log_var length. If not, raise an assertion error.

BC-breaking (Optional)

None.

Use cases (Optional)

None.

Checklist

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMDet3D.
  • The documentation has been modified accordingly, like docstring or example tutorials.

@CLAassistant
Copy link

CLAassistant commented Nov 13, 2021

CLA assistant check
All committers have signed the CLA.

@Junjun2016
Copy link
Collaborator

Hi @fingertap
Thanks for your contribution.

@fingertap
Copy link
Contributor Author

The unittest fails, but I think it is not due to this fix.

@Junjun2016
Copy link
Collaborator

Could you please provide more cases?
Since acc will also exist in the log_vars.

@fingertap
Copy link
Contributor Author

What cases do you need? I encountered this issue when using a batch size = 1 and the length of log_vars in different GPUs differs. This will cause some GPU to leave the for-loop earlier than others, leading to the rest of GPUs waiting for responses forever.

@Junjun2016
Copy link
Collaborator

The unittest fails, but I think it is not due to this fix.

Will fix it in this PR #1037.

@Junjun2016
Copy link
Collaborator

What cases do you need? I encountered this issue when using a batch size = 1 and the length of log_vars in different GPUs differs. This will cause some GPU to leave the for-loop earlier than others, leading to the rest of GPUs waiting for responses forever.

Could you provide the missed keys in log_vars?

@fingertap
Copy link
Contributor Author

fingertap commented Nov 14, 2021

What cases do you need? I encountered this issue when using a batch size = 1 and the length of log_vars in different GPUs differs. This will cause some GPU to leave the for-loop earlier than others, leading to the rest of GPUs waiting for responses forever.

Could you provide the missed keys in log_vars?

This is defined by user. For example, say I have a model that predicts segmentation mask for the image. The image has a boolean flag. We only calculate the loss when the flag is true. Then for images with a false flag, we do not need to calculate accuracy for it. In this case, I may implement something in my forward_train like this:

if flag:
    losses['acc'] = accuracy(pred, label)
return losses

When parsing this losses variable, the GPUs will hang. It is hard to find the cause when the code is complicated.

@fingertap
Copy link
Contributor Author

For more details, I quote the explanation from #1030 :

In the _parse_log function of the mmseg.segmentors.base.BaseSegmentor, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop (line 194):

for loss_name, loss_value in log_vars.items():
   # reduce loss when distributed training
   if dist.is_available() and dist.is_initialized():
       loss_value = loss_value.data.clone()
       dist.all_reduce(loss_value.div_(dist.get_world_size()))
   log_vars[loss_name] = loss_value.item()

One GPU A does not have a "roi_acc" as loss_name (suppose it is the last key in log_vars). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last "roi_acc" will try to call torch.distributed.all_reduce, which infinitely waits for the reply from GPU A.

@Junjun2016
Copy link
Collaborator

What cases do you need? I encountered this issue when using a batch size = 1 and the length of log_vars in different GPUs differs. This will cause some GPU to leave the for-loop earlier than others, leading to the rest of GPUs waiting for responses forever.

Could you provide the missed keys in log_vars?

This is defined by user. For example, say I have a model that predicts segmentation mask for the image. The image has a boolean flag. We only calculate the loss when the flag is true. Then for images with a false flag, we do not need to calculate accuracy for it. In this case, I may implement something in my forward_train like this:

if flag:
    losses['acc'] = accuracy(pred, label)
return losses

When parsing this losses variable, the GPUs will hang. It is hard to find the cause when the code is complicated.

I see.
If this is a general case for bs=1 on a customized dataset, how can we avoid or fix the bug?

@fingertap
Copy link
Contributor Author

fingertap commented Nov 14, 2021

We can inform the user about this issue, instead of letting the GPUs hang. I did this by triggering an assertion error:

# If the loss_vars has different length, GPUs will wait infinitely
if dist.is_available() and dist.is_initialized():
    log_var_length = torch.tensor(len(log_vars), device=loss.device)
    dist.all_reduce(log_var_length)
    assert log_var_length == len(log_vars) * dist.get_world_size(), \
        'loss log variables are different across GPUs!'

To allow the users skipping these loss terms, the users should explicitly set these loss terms as None. To illustrate this:

# Don't do this
if flag:
    losses['acc'] = accuracy(pred, label)

# Do this
losses['acc'] = accuracy(pred, label) if flag else None

It is the users' responsibility to ensure these losses have identical keys among GPUs to prevent synchronization issues. What we need to do is to allow such None values for loss_vars. The complete fix I came up with is (modifications are marked with # HERE):

@staticmethod
def _parse_losses(losses):
    """Parse the raw outputs (losses) of the network.

    Args:
        losses (dict): Raw output of the network, which usually contain
            losses and other necessary information.

    Returns:
        tuple[Tensor, dict]: (loss, log_vars), loss is the loss tensor
            which may be a weighted sum of all losses, log_vars contains
            all the variables to be sent to the logger.
    """
    log_vars = OrderedDict()
    for loss_name, loss_value in losses.items():
        if loss_value is None:  # HERE
            log_vars[loss_name] = None
        elif isinstance(loss_value, torch.Tensor):
            log_vars[loss_name] = loss_value.mean()
        elif isinstance(loss_value, list):
            log_vars[loss_name] = sum(_loss.mean() for _loss in loss_value)
        else:
            raise TypeError(
                f'{loss_name} is not a tensor or list of tensors')

    loss = sum(_value for _key, _value in log_vars.items()
               if 'loss' in _key)

    # HERE
    # If the loss_vars has different length, GPUs will wait infinitely
    if dist.is_available() and dist.is_initialized():
        log_var_length = torch.tensor(len(log_vars), device=loss.device)
        dist.all_reduce(log_var_length)
        assert log_var_length == len(log_vars) * dist.get_world_size(), \
            'loss log variables are different across GPUs!'

    log_vars['loss'] = loss
    for loss_name, loss_value in log_vars.items():
        # reduce loss when distributed training
        # HERE
        if dist.is_available() and dist.is_initialized():
            if loss_value is not None:
                sync_value = torch.tensor(
                    [loss_value.data.item(), 1.], device=loss.device)
            else:
                sync_value = torch.tensor([0., 0.], device=loss.device)
            dist.all_reduce(sync_value)
            # calculate mean if loss_value in some GPUs are not None
            if not sync_value[1] == 0.:
                log_vars[loss_name] = sync_value[0] / sync_value[1]

    log_vars = {k:v.item() for k, v in log_vars.items() if v is not None}

    return loss, log_vars

@fingertap
Copy link
Contributor Author

However, in this PR I did not include my fix for the None log_vars, as I think this is kind of a new feature rather than a bug. This PR is about the bug that GPUs hang without any useful error trace. If you decide to include this fix into other PRs, I suggest to include those features for a None value, too.

@Junjun2016
Copy link
Collaborator

Hi @fingertap
Thanks for your nice suggestions, we will discuss more about it.
Please merge the master to fix the failed CI.

@Junjun2016
Copy link
Collaborator

@hhaAndroid
Could you please give some comments?

@fingertap
Copy link
Contributor Author

Hi @fingertap Thanks for your nice suggestions, we will discuss more about it. Please merge the master to fix the failed CI.

The merging is blocked because of the test_mit testcase failed, which I think has nothing to do with my fix.

@hhaAndroid
Copy link

@hhaAndroid Could you please give some comments?

Done

@Junjun2016
Copy link
Collaborator

Junjun2016 commented Nov 15, 2021 via email

@Junjun2016
Copy link
Collaborator

Hi @fingertap Thanks for your nice suggestions, we will discuss more about it. Please merge the master to fix the failed CI.

The merging is blocked because of the test_mit testcase failed, which I think has nothing to do with my fix.

The BC-breaking has been fixed in the master branch, so you can merge the master branch to your branch.

@Junjun2016 Junjun2016 changed the title [#1034] fix dist training infinite waiting issue [Fix] Fix dist training infinite waiting issue Nov 15, 2021
@codecov
Copy link

codecov bot commented Nov 15, 2021

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.48%. Comparing base (6b3e550) to head (84858fc).
Report is 300 commits behind head on master.

Files Patch % Lines
mmseg/models/segmentors/base.py 0.00% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1035      +/-   ##
==========================================
- Coverage   89.55%   89.48%   -0.07%     
==========================================
  Files         119      119              
  Lines        6626     6631       +5     
  Branches     1034     1035       +1     
==========================================
  Hits         5934     5934              
- Misses        488      492       +4     
- Partials      204      205       +1     
Flag Coverage Δ
unittests 89.48% <0.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fingertap
Copy link
Contributor Author

Fixed the linting issue and pulled the upstream master.

@Junjun2016 Junjun2016 requested a review from xvjiarui November 16, 2021 12:48
@xvjiarui
Copy link
Collaborator

Hi @fingertap
Sorry for the late reply. I will review it shortly.

@Junjun2016
Copy link
Collaborator

Hi @hhaAndroid
Could you please provide the link of the same PR in mmdet and review it again?

@hhaAndroid
Copy link

Hi @hhaAndroid Could you please provide the link of the same PR in mmdet and review it again?

open-mmlab/mmdetection#6501

@Junjun2016 Junjun2016 merged commit f8ed148 into open-mmlab:master Dec 8, 2021
bowenroom pushed a commit to bowenroom/mmsegmentation that referenced this pull request Feb 25, 2022
* [open-mmlab#1034] fix dist training infinite waiting issue

* print log_vars keys in assertion msg

* linting issue
aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this pull request Mar 27, 2023
* remove batch size from repeat

* repeat empty string if uncond_tokens is none

* fix inpaint pipes

* return back whitespace to pass code quality

* Apply suggestions from code review

* Fix typos.

Co-authored-by: Had <had-95@yandex.ru>
wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this pull request Dec 3, 2023
* map location to cpu when load checkpoint (open-mmlab#1007)

* [Enhancement] Support minus output feature index in mobilenet_v3 (open-mmlab#1005)

* fix typo in mobilenet_v3

* fix typo in mobilenet_v3

* use -1 to indicate output tensors from final stage

* support negative out_indices

* [Enhancement] inference speed and flops tools. (open-mmlab#986)

* add the function to test the dummy forward speed of models.

* add tools to test the flops and inference speed of multiple models.

* [Fix] Update pose tracking demo to be compatible with latest mmtrakcing (open-mmlab#1014)

* update mmtracking demo

* support both track_bboxes and track_results

* add docstring

* [Fix] fix skeleton_info of coco wholebody dataset (open-mmlab#1010)

* fix wholebody base dataset

* fix lint

* fix lint

Co-authored-by: ly015 <liyining0712@gmail.com>

* [Feature] Add ViPNAS models for wholebody keypoint detection (open-mmlab#1009)

* add configs

* add dark configs

* add checkpoint and readme

* update webcam demo

* fix model path in webcam demo

* fix unittest

* [Fix] Fix bbox label visualization (open-mmlab#1020)

* update model metafiles (open-mmlab#1001)

* update hourglass ae .md (open-mmlab#1027)

* [Feature] Add ViPNAS mbv3 (open-mmlab#1025)

* add vipnas mbv3

* test other variants

* submission for mmpose

* add unittest

* add readme

* update .yml

* fix lint

* rebase

* fix pytest

Co-authored-by: jin-s13 <jinsheng13@foxmail.com>

* [Enhancement] Set a random seed when the user does not set a seed (open-mmlab#1030)

* fix randseed

* fix lint

* fix import

* fix isort

* update yapf hook

* revert yapf version

* add cfg file for flops and speed test,  change the bulid_posenet to init_pose_model and fix some typo in cfg (open-mmlab#1028)

* [Enhancement] Add more functions for speed test tool (open-mmlab#1034)

* add batch size and device args in speed test script, and remove MMDataParallel warper

* add vipnas_mbv3 model

* fix dead link (open-mmlab#1038)

* Skip CI when some specific files were changed (open-mmlab#1041)

* update sigmas (open-mmlab#1040)

* add more configs, ckpts and logs for HRNet on PoseTrack18 (open-mmlab#1035)

* [Feature] Add PoseWarper dataset (open-mmlab#1006)

* add PoseWarper dataset and base class

* modify pipelines related to video

* add unittest for PoseWarper dataset

* add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files

* fix typo

* fix unittest CI failure

* fix typo

* add PoseWarper dataset and base class

* modify pipelines related to video

* add unittest for PoseWarper dataset

* add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files

* fix typo

* fix unittest CI failure

* fix typo

* modify some methods in the base class to improve code coverage rate

* recover some mistakenly-deleted notes

* remove test_dataset_info part for the new TopDownPoseTrack18VideoDataset class

* cancel uncompleted previous runs (open-mmlab#1053)

* [Doc] Add inference speed results (open-mmlab#1044)

* add docs related to inference speed results

* add corresponding Chinese docs and fix some typos

* add Chinese docs in readthedocs

* remove the massive table in readme

* minor modification to wording

Co-authored-by: ly015 <liyining0712@gmail.com>

* [Feature] Add PoseWarper detector model (open-mmlab#932)

* Add top down video detector module

* Add PoseWarper neck

* add function _freeze_stages

* fix typo

* modify PoseWarper detector and PoseWarperNeck

* fix typo

* modify posewarper detector and neck

* Delete top_down_video.py

change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown`

* fix spell typo

* modify detector and neck

* add unittest for detector and neck

* modify unittest for posewarper forward

* Add top down video detector module

* Add PoseWarper neck

* add function _freeze_stages

* fix typo

* modify PoseWarper detector and PoseWarperNeck

* fix typo

* modify posewarper detector and neck

* Delete top_down_video.py

change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown`

* fix spell typo

* modify detector and neck

* add unittest for detector and neck

* modify unittest for posewarper forward

* modify dependency on mmcv version in posewarper neck

* reduce memory cost in test

* modify flops tool for more flexible input format

* Add top down video detector module

* Add PoseWarper neck

* add function _freeze_stages

* fix typo

* modify PoseWarper detector and PoseWarperNeck

* fix typo

* modify posewarper detector and neck

* Delete top_down_video.py

change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown`

* fix spell typo

* modify detector and neck

* add unittest for detector and neck

* modify unittest for posewarper forward

* Add PoseWarper neck

* modify PoseWarper detector and PoseWarperNeck

* modify posewarper detector and neck

* Delete top_down_video.py

change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown`

* fix spell typo

* modify detector and neck

* add unittest for detector and neck

* modify unittest for posewarper forward

* modify dependency on mmcv version in posewarper neck

* reduce memory cost in test

* modify flops tool for more flexible input format

* modify the posewarper detector description

* modify some arguments and related fields

* modify default values for some args

* fix readthedoc bulid typo

* fix ignore path (open-mmlab#1059)

* [Doc]  Add related docs for PoseWarper (open-mmlab#1036)

* add related docs for PoseWarper

* add related readme docs for posewarper

* modify related args in posewarper stage2 config

* modify posewarper stage2 config path

* add description about val_boxes path for data preparation (open-mmlab#1060)

* bump version to v0.21.0 (open-mmlab#1061)

* [Feature] Add ViPNAS_Mbv3 wholebody model (open-mmlab#1055)

* add vipnas mbv3 coco_wholebody

* add vipnas mbv3 coco_wholebody md&yml

* fix lint

Co-authored-by: ly015 <liyining0712@gmail.com>

Co-authored-by: Lumin <30328525+luminxu@users.noreply.github.com>
Co-authored-by: zengwang430521 <zengwang430521@gmail.com>
Co-authored-by: Jas <jinsheng@sensetime.com>
Co-authored-by: jin-s13 <jinsheng13@foxmail.com>
Co-authored-by: Qikai Li <87690686+liqikai9@users.noreply.github.com>
Co-authored-by: QwQ2000 <396707050@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants