Distributed training hangs due to missing keys in `mmseg.segmentors.base.BaseSegmentor._parse_losses` #1030

fingertap · 2021-11-10T08:22:45Z

When training on multiple GPUs, my code of customized model get stuck. When training on only one GPU, it works good. Ctrl+C gives me the following error stack:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 173, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 169, in main
    run(args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/run.py", line 624, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py", line 238, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py", line 700, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py", line 828, in _invoke_run
    time.sleep(monitor_interval)
KeyboardInterrupt

I cannot find many useful information online. Any advices on how to debug further?

Environment:

------------------------------------------------------------
sys.platform: linux
Python: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.9.0+cu102
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.0+cu102
OpenCV: 4.5.3
MMCV: 1.3.14
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMSegmentation: 0.18.0+ef68770
------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

fingertap · 2021-11-10T08:26:35Z

Distributed training a DeepLabv3 config is fine. So I think it has something to do with my model. My workload is different from batch to batch. Some may take a bit more time than others depending on the data itself. The code is complicated so I cannot give you a simple example code to reproduce.

Any idea about how to debug this? Thanks in advance!

MengzhangLI · 2021-11-10T10:40:59Z

Hi, based on your limited description I can not give suggestions.

I think it is caused by your customized model setting.

fingertap · 2021-11-10T15:22:11Z

Basically, the model does the following:

feature extraction with backbone
use this feature map for a classification task
fpn neck
roi align (use the SingleRoIExtractor from mmdet)
predict a binary variable x for each of the ROIs
if the label y of x is 1, do segmentation on that RoI feature

Everything works fine with a single GPU. When starting distributed training, it first raises the error that some parameters are unused. Then I multiply all outputs of the forward function with zero and add it to the final loss. Then I get rid of this unused parameter error, but it get stuck at backprop when one of the GPU encounters the case where y equals 0 for all RoIs.

fingertap · 2021-11-10T15:33:10Z

At the beginning, I wrote something like

if y.any():
    my_seg_head.forward_train(x, y)

I thought this may cause the case where one GPU enters this if but another GPU doesn't, which may result in different running time and lead to the cross-GPU communication timeout (is this correct?). So I removed this if, predict the segmentation masks for all data, and zero out those undesired terms. However, the error remains.

This may be a pytorch issue, as the mmcv.DistributedDataParallel inherits from pytorch, yet my code is based on mmseg and mmdet, so I think I may get help here. Thanks~

fingertap · 2021-11-10T15:45:24Z

Some logs (I add the === lines manually):

2021-11-10 15:40:14,378 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters
mask.any()  tensor(True, device='cuda:0')
Finish forward_train.
mask.any()  tensor(True, device='cuda:1')
Finish forward_train.
=====Iter 1========
2021-11-10 15:40:35,835 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
mask.any()  tensor(True, device='cuda:1')
Finish forward_train.
mask.any()  tensor(True, device='cuda:0')
Finish forward_train.
=====Iter 2========
mask.any()  tensor(True, device='cuda:1')
Finish forward_train.
mask.any()  tensor(True, device='cuda:0')
Finish forward_train.
=====Iter 3========
mask.any()  tensor(True, device='cuda:0')
Finish forward_train.
mask.any()  tensor(True, device='cuda:1')
Finish forward_train.
=====Iter 4========
mask.any()  tensor(False, device='cuda:1')
Finish forward_train.
mask.any()  tensor(True, device='cuda:0')
Finish forward_train.
=====Iter 5========

We can see the forward_train pass is finished. I also tried to downgrade pytorch to 1.8, no luck.

fingertap · 2021-11-12T06:30:07Z

Now I find the reason. I should always keep the calculation workflow identical among the GPUs, including the accuracy terms. I have the codes similar to the following:

if mask.any():
    loss['roi_acc'] = roi_acc(feats)

The reason to this is that, for images that do not have mask.any() == True, the accuracy will be zero, which will affect the accuracy in logging. So I choose not to print these logs in this case, as the accuracy is not used in backprop. However, if some of the GPU does not have this key, all GPUs will hang. I think this is kind of a bug of mmcv.

fingertap · 2021-11-12T06:39:49Z

@MengzhangLI
In the _parse_log function of the mmseg.segmentors.base.BaseSegmentor, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop (line 194):

for loss_name, loss_value in log_vars.items():
    # reduce loss when distributed training
    if dist.is_available() and dist.is_initialized():
        loss_value = loss_value.data.clone()
        dist.all_reduce(loss_value.div_(dist.get_world_size()))
    log_vars[loss_name] = loss_value.item()

One GPU A does not have a "roi_acc" as loss_name (suppose it is the last key in log_vars). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last "roi_acc" will try to call torch.distributed.all_reduce, which infinitely waits for the reply from GPU A.

A quick fix is to delete this roi_acc or setting it to zero for unavailable data. A good fix is for this _parse_loss to divide a runtime counter instead of dist.get_world_size(), and those counters are not recognized as a metric variable:

for loss_name, loss_value in log_vars.items():
    if loss_name.endswith('_dist_counter'):  # e.g. roi_acc_dist_counter -> roi_acc
        if dist.is_available() and dist.is_initialized():
            dist_count = loss_value.data.clone()
            dist.all_reduce(dist_count)
            key = loss_name.replace('_dist_counter', '')
            log_vars[key] *= dist.get_world_size() / dist_count.item()
        del log_vars[loss_name]
    else:
        # reduce loss when distributed training
        if dist.is_available() and dist.is_initialized():
            loss_value = loss_value.data.clone()
            dist.all_reduce(loss_value.div_(dist.get_world_size()))
    log_vars[loss_name] = loss_value.item()

For those GPUs without a "roi_acc" key, they just set it to zero (or using a defaultdict).

fingertap · 2021-11-12T07:31:12Z

@MengzhangLI In the _parse_log function of the mmseg.segmentors.base.BaseSegmentor, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop (line 194):
for loss_name, loss_value in log_vars.items():
    # reduce loss when distributed training
    if dist.is_available() and dist.is_initialized():
        loss_value = loss_value.data.clone()
        dist.all_reduce(loss_value.div_(dist.get_world_size()))
    log_vars[loss_name] = loss_value.item()
One GPU A does not have a "roi_acc" as loss_name (suppose it is the last key in log_vars). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last "roi_acc" will try to call torch.distributed.all_reduce, which infinitely waits for the reply from GPU A.

A quick fix is to delete this roi_acc or setting it to zero for unavailable data. A good fix is for this _parse_loss to divide a runtime counter instead of dist.get_world_size(), and those counters are not recognized as a metric variable:
for loss_name, loss_value in log_vars.items():
    if loss_name.endswith('_dist_counter'):  # e.g. roi_acc_dist_counter -> roi_acc
        if dist.is_available() and dist.is_initialized():
            dist_count = loss_value.data.clone()
            dist.all_reduce(dist_count)
            key = loss_name.replace('_dist_counter', '')
            log_vars[key] *= dist.get_world_size() / dist_count.item()
        del log_vars[loss_name]
    else:
        # reduce loss when distributed training
        if dist.is_available() and dist.is_initialized():
            loss_value = loss_value.data.clone()
            dist.all_reduce(loss_value.div_(dist.get_world_size()))
    log_vars[loss_name] = loss_value.item()
For those GPUs without a "roi_acc" key, they just set it to zero (or using a defaultdict).

This solution is not perfect, when all batch do not have this "roi_acc" key, it becomes nan.

fingertap · 2021-11-12T08:05:54Z

@MengzhangLI Can I pass a None for log_vars to skip this item?

MengzhangLI · 2021-11-12T08:14:06Z

Hi, sorry for late reply.

Very happy to see you’ve fixed your problem basically. Hope you like our codebase.

You can try to rename loss_name because the prefix must be loss_, see bottom of here.

MengzhangLI · 2021-11-12T08:15:37Z

@MengzhangLI Can I pass a None for log_vars to skip this item?

Frankly, I am not sure. Could you have a try and hoping to get your feedback!

fingertap · 2021-11-12T08:16:24Z

I know that if one key has "loss" in its loss_name, it will be involved in backprop. However, I think mmseg lacks this feature to support the case when some GPU does not provide some loss (or provide None to it).

fingertap · 2021-11-12T08:17:06Z

@MengzhangLI Can I pass a None for log_vars to skip this item?

Frankly, I am not sure. Could you have a try and hoping to get your feedback!

Actually I can do this, maybe when I have time, I'd create a PR. Closing this.

MengzhangLI · 2021-11-12T08:18:00Z

@MengzhangLI Can I pass a None for log_vars to skip this item?

Frankly, I am not sure. Could you have a try and hoping to get your feedback!

Actually I can do this, maybe when I have time, I'd create a PR. Closing this.

That would be very cool, can’t wait to work together with you.

Best,

MengzhangLI · 2021-11-12T08:20:30Z

Also I will notice this potential bugs and we would try to fix it. Thank you very much for your excellent issue.

fingertap · 2021-11-12T08:47:26Z

I opened another issue #1034 reporting this bug. And I will also try to provide a fix for it.

fingertap · 2021-11-12T08:59:45Z

Hi, sorry for late reply.

Very happy to see you’ve fixed your problem basically. Hope you like our codebase.

You can try to rename loss_name because the prefix must be loss_, see bottom of here.

Also, there is an error in your tutorial here. You said only the losses with loss_ as prefix will be involved in gradient backprop, yet it is not the case. Your code:

loss = sum(_value for _key, _value in log_vars.items()
           if 'loss' in _key)

suggests any losses with loss in it will get involved in the backprop process ;-)

MengzhangLI · 2021-11-12T09:03:11Z

You are right. Sorry for my misunderstanding statement.

fingertap · 2021-11-13T03:31:36Z

@MengzhangLI Can I pass a None for log_vars to skip this item?

Frankly, I am not sure. Could you have a try and hoping to get your feedback!

I fixed this too. Should I also create a PR for this feature? I have pushed a PR to fix the infinite waiting.

* map location to cpu when load checkpoint (open-mmlab#1007) * [Enhancement] Support minus output feature index in mobilenet_v3 (open-mmlab#1005) * fix typo in mobilenet_v3 * fix typo in mobilenet_v3 * use -1 to indicate output tensors from final stage * support negative out_indices * [Enhancement] inference speed and flops tools. (open-mmlab#986) * add the function to test the dummy forward speed of models. * add tools to test the flops and inference speed of multiple models. * [Fix] Update pose tracking demo to be compatible with latest mmtrakcing (open-mmlab#1014) * update mmtracking demo * support both track_bboxes and track_results * add docstring * [Fix] fix skeleton_info of coco wholebody dataset (open-mmlab#1010) * fix wholebody base dataset * fix lint * fix lint Co-authored-by: ly015 <liyining0712@gmail.com> * [Feature] Add ViPNAS models for wholebody keypoint detection (open-mmlab#1009) * add configs * add dark configs * add checkpoint and readme * update webcam demo * fix model path in webcam demo * fix unittest * [Fix] Fix bbox label visualization (open-mmlab#1020) * update model metafiles (open-mmlab#1001) * update hourglass ae .md (open-mmlab#1027) * [Feature] Add ViPNAS mbv3 (open-mmlab#1025) * add vipnas mbv3 * test other variants * submission for mmpose * add unittest * add readme * update .yml * fix lint * rebase * fix pytest Co-authored-by: jin-s13 <jinsheng13@foxmail.com> * [Enhancement] Set a random seed when the user does not set a seed (open-mmlab#1030) * fix randseed * fix lint * fix import * fix isort * update yapf hook * revert yapf version * add cfg file for flops and speed test, change the bulid_posenet to init_pose_model and fix some typo in cfg (open-mmlab#1028) * [Enhancement] Add more functions for speed test tool (open-mmlab#1034) * add batch size and device args in speed test script, and remove MMDataParallel warper * add vipnas_mbv3 model * fix dead link (open-mmlab#1038) * Skip CI when some specific files were changed (open-mmlab#1041) * update sigmas (open-mmlab#1040) * add more configs, ckpts and logs for HRNet on PoseTrack18 (open-mmlab#1035) * [Feature] Add PoseWarper dataset (open-mmlab#1006) * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * add PoseWarper dataset and base class * modify pipelines related to video * add unittest for PoseWarper dataset * add unittest for evaluation function in posetrack18-realted dataset, and add some annotations json files * fix typo * fix unittest CI failure * fix typo * modify some methods in the base class to improve code coverage rate * recover some mistakenly-deleted notes * remove test_dataset_info part for the new TopDownPoseTrack18VideoDataset class * cancel uncompleted previous runs (open-mmlab#1053) * [Doc] Add inference speed results (open-mmlab#1044) * add docs related to inference speed results * add corresponding Chinese docs and fix some typos * add Chinese docs in readthedocs * remove the massive table in readme * minor modification to wording Co-authored-by: ly015 <liyining0712@gmail.com> * [Feature] Add PoseWarper detector model (open-mmlab#932) * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * Add top down video detector module * Add PoseWarper neck * add function _freeze_stages * fix typo * modify PoseWarper detector and PoseWarperNeck * fix typo * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * Add PoseWarper neck * modify PoseWarper detector and PoseWarperNeck * modify posewarper detector and neck * Delete top_down_video.py change the base class of `PoseWarper` detector from `TopDownVideo` to `TopDown` * fix spell typo * modify detector and neck * add unittest for detector and neck * modify unittest for posewarper forward * modify dependency on mmcv version in posewarper neck * reduce memory cost in test * modify flops tool for more flexible input format * modify the posewarper detector description * modify some arguments and related fields * modify default values for some args * fix readthedoc bulid typo * fix ignore path (open-mmlab#1059) * [Doc] Add related docs for PoseWarper (open-mmlab#1036) * add related docs for PoseWarper * add related readme docs for posewarper * modify related args in posewarper stage2 config * modify posewarper stage2 config path * add description about val_boxes path for data preparation (open-mmlab#1060) * bump version to v0.21.0 (open-mmlab#1061) * [Feature] Add ViPNAS_Mbv3 wholebody model (open-mmlab#1055) * add vipnas mbv3 coco_wholebody * add vipnas mbv3 coco_wholebody md&yml * fix lint Co-authored-by: ly015 <liyining0712@gmail.com> Co-authored-by: Lumin <30328525+luminxu@users.noreply.github.com> Co-authored-by: zengwang430521 <zengwang430521@gmail.com> Co-authored-by: Jas <jinsheng@sensetime.com> Co-authored-by: jin-s13 <jinsheng13@foxmail.com> Co-authored-by: Qikai Li <87690686+liqikai9@users.noreply.github.com> Co-authored-by: QwQ2000 <396707050@qq.com>

MengzhangLI self-assigned this Nov 10, 2021

fingertap changed the title ~~Distributed training hangs~~ Distributed training hangs due to missing keys in mmseg.segmentors.base.BaseSegmentor._parse_losses Nov 12, 2021

fingertap closed this as completed Nov 12, 2021

fingertap mentioned this issue Nov 12, 2021

Distributed training with different log_vars keys among GPUs hangs the entire work process #1034

Closed

2 tasks

fingertap mentioned this issue Nov 13, 2021

Distributed training will hang if log_vars has different length among GPUs open-mmlab/mmdetection#6495

Closed

fingertap mentioned this issue Nov 14, 2021

[Fix] Fix dist training infinite waiting issue #1035

Merged

4 tasks

aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this issue Mar 27, 2023

Fix padding in dreambooth (open-mmlab#1030)

33c4874

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training hangs due to missing keys in `mmseg.segmentors.base.BaseSegmentor._parse_losses` #1030

Distributed training hangs due to missing keys in `mmseg.segmentors.base.BaseSegmentor._parse_losses` #1030

fingertap commented Nov 10, 2021

fingertap commented Nov 10, 2021

MengzhangLI commented Nov 10, 2021

fingertap commented Nov 10, 2021

fingertap commented Nov 10, 2021 •

edited

Loading

fingertap commented Nov 10, 2021 •

edited

Loading

fingertap commented Nov 12, 2021

fingertap commented Nov 12, 2021 •

edited

Loading

fingertap commented Nov 12, 2021

fingertap commented Nov 12, 2021

MengzhangLI commented Nov 12, 2021 •

edited

Loading

MengzhangLI commented Nov 12, 2021

fingertap commented Nov 12, 2021

fingertap commented Nov 12, 2021

MengzhangLI commented Nov 12, 2021

MengzhangLI commented Nov 12, 2021

fingertap commented Nov 12, 2021

fingertap commented Nov 12, 2021

MengzhangLI commented Nov 12, 2021

fingertap commented Nov 13, 2021

Distributed training hangs due to missing keys in mmseg.segmentors.base.BaseSegmentor._parse_losses #1030

Distributed training hangs due to missing keys in mmseg.segmentors.base.BaseSegmentor._parse_losses #1030

Comments

fingertap commented Nov 10, 2021

fingertap commented Nov 10, 2021

MengzhangLI commented Nov 10, 2021

fingertap commented Nov 10, 2021

fingertap commented Nov 10, 2021 • edited Loading

fingertap commented Nov 10, 2021 • edited Loading

fingertap commented Nov 12, 2021

fingertap commented Nov 12, 2021 • edited Loading

fingertap commented Nov 12, 2021

fingertap commented Nov 12, 2021

MengzhangLI commented Nov 12, 2021 • edited Loading

MengzhangLI commented Nov 12, 2021

fingertap commented Nov 12, 2021

fingertap commented Nov 12, 2021

MengzhangLI commented Nov 12, 2021

MengzhangLI commented Nov 12, 2021

fingertap commented Nov 12, 2021

fingertap commented Nov 12, 2021

MengzhangLI commented Nov 12, 2021

fingertap commented Nov 13, 2021

Distributed training hangs due to missing keys in `mmseg.segmentors.base.BaseSegmentor._parse_losses` #1030

Distributed training hangs due to missing keys in `mmseg.segmentors.base.BaseSegmentor._parse_losses` #1030

fingertap commented Nov 10, 2021 •

edited

Loading

fingertap commented Nov 10, 2021 •

edited

Loading

fingertap commented Nov 12, 2021 •

edited

Loading

MengzhangLI commented Nov 12, 2021 •

edited

Loading