Skip to content

multi-gpu is unstable? #51

Closed
Closed
@blandocs

Description

@blandocs

If you do not know the root cause of the problem / bug, and wish someone to help you, please
include:

To Reproduce

  1. what changes you made / what code you wrote

In tools/train_net.py, I add new dataset in the beginning of main function.

def main(args):  
    register_coco_instances("moda", {}, "moda.json", "datasets/moda/images")

You can download moda.json here
You can also download partial of moda images here
Full images are here. Not recommended due to large size.

configs/modanet.yaml as follows:

_BASE_: "./Base-RCNN-FPN.yaml"
MODEL:
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
  # WEIGHTS: "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl"  # initialize from model zoo
  MASK_ON: True
  RESNETS:
    DEPTH: 50
  ROI_HEADS:
    NUM_CLASSES: 13
DATASETS:
  TRAIN: ("moda",)
  TEST: ()
DATALOADER:
  ASPECT_RATIO_GROUPING: False
  # NUM_WORKERS: 4
SOLVER:
  IMS_PER_BATCH: 20
  BASE_LR: 0.01
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  1. what command you run

python tools/train_net.py --num-gpus 4 --config-file configs/modanet.yaml

  1. what you observed (full logs are preferred)

When I use single GPU, it always works fine! But, when I tried to use multiple GPU several bugs occur randomly. Rarely, multiple GPU works fine. What's wrong with it?

First bug is about Box2BoxTransform. When I debug it, the anchor's width is lower than 0.

[10/14 15:39:39 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distribute
d.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 161, in forward
    losses = {k: v * self.loss_weight for k, v in outputs.losses().items()}
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 316, in l
osses
    gt_objectness_logits, gt_anchor_deltas = self._get_ground_truth()
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 283, in _
get_ground_truth
    anchors_i.tensor, matched_gt_boxes.tensor
  File "/SSD/hyunsu/detectron2/detectron2/modeling/box_regression.py", line 63, in get_deltas
    assert (src_widths > 0).all().item(), "Input boxes to Box2BoxTransform are not valid!"
AssertionError: Input boxes to Box2BoxTransform are not valid!

Second bug is as follows:


Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 143, in forward
    anchors = self.anchor_generator(features)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 181, in forward
    anchors_over_all_feature_maps = self.grid_anchors(grid_sizes)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 124, in grid_anchors
    shift_x, shift_y = _create_grid_offsets(size, stride, base_anchors.device)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 43, in _create_grid_offsets
    shifts_x = torch.arange(0, grid_width * stride, step=stride, dtype=torch.float32, device=device)
RuntimeError: tabulate: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Third bug is as follows:

[10/14 15:42:59 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 161, in forward
    losses = {k: v * self.loss_weight for k, v in outputs.losses().items()}
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 316, in losses
    gt_objectness_logits, gt_anchor_deltas = self._get_ground_truth()
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 268, in _get_ground_truth
    matched_idxs, gt_objectness_logits_i = self.anchor_matcher(match_quality_matrix)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/matcher.py", line 78, in __call__
    assert torch.all(match_quality_matrix >= 0)
AssertionError

When it works, the distribution between GPUs is unbalanced as follows:

(detectron2) root@b06e1b5c1ffb:/SSD/hyunsu/detectron2# nvidia-smi
Sun Oct 13 16:49:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            On   | 00000000:83:00.0 Off |                  N/A |
| 39%   65C    P2   206W / 250W |  11539MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            On   | 00000000:84:00.0 Off |                  N/A |
| 40%   65C    P2   206W / 250W |   8375MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            On   | 00000000:87:00.0 Off |                  N/A |
| 47%   75C    P2   236W / 250W |  11089MiB / 12196MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            On   | 00000000:88:00.0 Off |                  N/A |
| 49%   79C    P2   257W / 250W |   8409MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Expected behavior

It should work both single and multiple GPUs. But, I feel like it's quite unstable when I use multiple GPU.

Environment

Please paste the output of python -m detectron2.utils.collect_env.

/home/user/miniconda/envs/detectron2/bin/python: Error while finding module specification for 'detectron2.utils.collect_env.' (ModuleNotFoundError: __path__ attribute not found on 'detectron2.utils.collect_env' while trying to find 'detectron2.utils.collect_env.')
(detectron2) root@b06e1b5c1ffb:/SSD/hyunsu/detectron2# python -m detectron2.utils.collect_env .
---------------------  --------------------------------------------------
Python                 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Detectron2 Compiler    GCC 5.4
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.3.0
PyTorch Debug Build    False
CUDA available         True
GPU 0,1,2,3            TITAN Xp
Pillow                 6.2.0
cv2                    4.1.1
---------------------  --------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions