Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu is unstable? #51

Closed
blandocs opened this issue Oct 13, 2019 · 6 comments
Closed

multi-gpu is unstable? #51

blandocs opened this issue Oct 13, 2019 · 6 comments

Comments

@blandocs
Copy link

blandocs commented Oct 13, 2019

If you do not know the root cause of the problem / bug, and wish someone to help you, please
include:

To Reproduce

  1. what changes you made / what code you wrote

In tools/train_net.py, I add new dataset in the beginning of main function.

def main(args):  
    register_coco_instances("moda", {}, "moda.json", "datasets/moda/images")

You can download moda.json here
You can also download partial of moda images here
Full images are here. Not recommended due to large size.

configs/modanet.yaml as follows:

_BASE_: "./Base-RCNN-FPN.yaml"
MODEL:
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
  # WEIGHTS: "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl"  # initialize from model zoo
  MASK_ON: True
  RESNETS:
    DEPTH: 50
  ROI_HEADS:
    NUM_CLASSES: 13
DATASETS:
  TRAIN: ("moda",)
  TEST: ()
DATALOADER:
  ASPECT_RATIO_GROUPING: False
  # NUM_WORKERS: 4
SOLVER:
  IMS_PER_BATCH: 20
  BASE_LR: 0.01
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  1. what command you run

python tools/train_net.py --num-gpus 4 --config-file configs/modanet.yaml

  1. what you observed (full logs are preferred)

When I use single GPU, it always works fine! But, when I tried to use multiple GPU several bugs occur randomly. Rarely, multiple GPU works fine. What's wrong with it?

First bug is about Box2BoxTransform. When I debug it, the anchor's width is lower than 0.

[10/14 15:39:39 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.
py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distribute
d.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py",
line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 161, in forward
    losses = {k: v * self.loss_weight for k, v in outputs.losses().items()}
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 316, in l
osses
    gt_objectness_logits, gt_anchor_deltas = self._get_ground_truth()
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 283, in _
get_ground_truth
    anchors_i.tensor, matched_gt_boxes.tensor
  File "/SSD/hyunsu/detectron2/detectron2/modeling/box_regression.py", line 63, in get_deltas
    assert (src_widths > 0).all().item(), "Input boxes to Box2BoxTransform are not valid!"
AssertionError: Input boxes to Box2BoxTransform are not valid!

Second bug is as follows:


Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 143, in forward
    anchors = self.anchor_generator(features)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 181, in forward
    anchors_over_all_feature_maps = self.grid_anchors(grid_sizes)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 124, in grid_anchors
    shift_x, shift_y = _create_grid_offsets(size, stride, base_anchors.device)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/anchor_generator.py", line 43, in _create_grid_offsets
    shifts_x = torch.arange(0, grid_width * stride, step=stride, dtype=torch.float32, device=device)
RuntimeError: tabulate: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Third bug is as follows:

[10/14 15:42:59 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
  File "modanet.py", line 162, in <module>
    args=(args,),
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:


-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/SSD/hyunsu/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/SSD/hyunsu/detectron2/modanet.py", line 146, in main
    return trainer.train()
  File "/SSD/hyunsu/detectron2/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/SSD/hyunsu/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/home/user/miniconda/envs/detectron2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn.py", line 161, in forward
    losses = {k: v * self.loss_weight for k, v in outputs.losses().items()}
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 316, in losses
    gt_objectness_logits, gt_anchor_deltas = self._get_ground_truth()
  File "/SSD/hyunsu/detectron2/detectron2/modeling/proposal_generator/rpn_outputs.py", line 268, in _get_ground_truth
    matched_idxs, gt_objectness_logits_i = self.anchor_matcher(match_quality_matrix)
  File "/SSD/hyunsu/detectron2/detectron2/modeling/matcher.py", line 78, in __call__
    assert torch.all(match_quality_matrix >= 0)
AssertionError

When it works, the distribution between GPUs is unbalanced as follows:

(detectron2) root@b06e1b5c1ffb:/SSD/hyunsu/detectron2# nvidia-smi
Sun Oct 13 16:49:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            On   | 00000000:83:00.0 Off |                  N/A |
| 39%   65C    P2   206W / 250W |  11539MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            On   | 00000000:84:00.0 Off |                  N/A |
| 40%   65C    P2   206W / 250W |   8375MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            On   | 00000000:87:00.0 Off |                  N/A |
| 47%   75C    P2   236W / 250W |  11089MiB / 12196MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            On   | 00000000:88:00.0 Off |                  N/A |
| 49%   79C    P2   257W / 250W |   8409MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Expected behavior

It should work both single and multiple GPUs. But, I feel like it's quite unstable when I use multiple GPU.

Environment

Please paste the output of python -m detectron2.utils.collect_env.

/home/user/miniconda/envs/detectron2/bin/python: Error while finding module specification for 'detectron2.utils.collect_env.' (ModuleNotFoundError: __path__ attribute not found on 'detectron2.utils.collect_env' while trying to find 'detectron2.utils.collect_env.')
(detectron2) root@b06e1b5c1ffb:/SSD/hyunsu/detectron2# python -m detectron2.utils.collect_env .
---------------------  --------------------------------------------------
Python                 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Detectron2 Compiler    GCC 5.4
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.3.0
PyTorch Debug Build    False
CUDA available         True
GPU 0,1,2,3            TITAN Xp
Pillow                 6.2.0
cv2                    4.1.1
---------------------  --------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

@ppwwyyxx
Copy link
Contributor

I suggest you first try using the existing COCO configs on COCO dataset, to see whether there is anything special in your dataset that can cause the issue. This would make it much easier to isolate the potential causes.

The error messages you saw does indicate that the anchors may have a non-positive size. However, the way the anchors are generated (in anchor_generator.py and then rpn.py, and then rpn_outputs.py) should guarantee a positive size. You may need to track its generation to see how it can become non-positive.

@blandocs
Copy link
Author

blandocs commented Oct 14, 2019

Hi there. Thank you for reply! I looked dataset carefully, but I can't find any differences between original coco dataset and my dataset. I described both below. Additionally, the bugs came out randomly. When I tried training, three bugs came out randomly as I mentioned in the article and Training suddenly succeeded during several attempts. It's so weird.

coco dataset as follows:

	"images": [{
		"license": 4,
		"file_name": "000000397133.jpg",
		"coco_url": "http://images.cocodataset.org/val2017/000000397133.jpg",
		"height": 427,
		"width": 640,
		"date_captured": "2013-11-14 17:02:52",
		"flickr_url": "http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg",
		"id": 397133
	},

	"annotations": [{
		"segmentation": [
			[510.66, 423.01, 511.72, 420.03, . . . , 510.03, 423.01, 510.45, 423.01]
		],
		"area": 702.1057499999998,
		"iscrowd": 0,
		"image_id": 289343,
		"bbox": [473.07, 395.93, 38.65, 28.67],
		"category_id": 18,
		"id": 1768
	}, {
		"segmentation": [

my dataset as follows:

  "images": [
    {
      "file_name": "0763229.jpg",
      "width": 400,
      "id": 763229,
      "license": 3,
      "height": 600
    },
"annotations": [
    {
      "segmentation": [
        [
          146,
          363,
          154,
          373,
           . . . 
          147,
          349,
          144,
          363
        ]
      ],
      "area": 28951,
      "iscrowd": 0,
      "image_id": 763229,
      "bbox": [
        122,
        168,
        131,
        221
      ],
      "category_id": 9,
      "id": 10
    },    {
      "segmentation": [

@blandocs
Copy link
Author

blandocs commented Oct 23, 2019

After I add "bbox_mode": 1 in my json file, above errors are resolved(My dataset has XYWH_ABS type of BBox). But I think there are still race conditions. Because cuda error sometimes occurs even if the code is same.

@ghost
Copy link

ghost commented Nov 12, 2019

"Input boxes to Box2BoxTransform are not valid!"
AssertionError: Input boxes to Box2BoxTransform are not valid!

hello, I have also met the problem.
"Input boxes to Box2BoxTransform are not valid!"
AssertionError: Input boxes to Box2BoxTransform are not valid!

I checked my bbox, they are valid, could you provide me how to solve the first bug ?

@defqoon
Copy link

defqoon commented Feb 14, 2020

@zhoudongliang did you check that the number of classes in your config file is correct? I had the same bug and I fixed it by setting cfg.MODEL.ROI_HEADS.NUM_CLASSES for my Faster Rcnn model to the correct value

@ChauncyFr
Copy link

After I add "bbox_mode": 1 in my json file, above errors are resolved(My dataset has XYWH_ABS type of BBox). But I think there are still race conditions. Because cuda error sometimes occurs even if the code is same.

I set this item in the json file, it is not possible, no matter the number of GPUs is 1 or 2. Who can answer me, when I train, this error always occurs and the training is interrupted, which is uncomfortable!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants