ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) #1618

deepakkupanda · 2022-05-26T13:02:13Z

I am trying to run beit algorithm using duts dataset

tools/dist_train.sh configs/beit/upernet_beit-base_640x640_80k_duts_ms.py 1 --work-dir work_dirs/upernet_beit-base_640x640_80k_duts/ --deterministic

2022-05-26 12:51:23,588 - mmseg - INFO - Checkpoints will be saved to /mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/work_dirs/upernet_beit-base_640x640_80k_duts by HardDiskBackend.
2022-05-26 12:54:05,915 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/tools/train.py", line 240, in
main()
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/tools/train.py", line 229, in main
train_segmentor(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/apis/train.py", line 191, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/parallel/distributed.py", line 59, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 198, in new_func
return old_func(*args, kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses
loss['acc_seg'] = accuracy(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1646755897462/work/c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f79ad35d1bd in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x1f037 (0x7f79df9aa037 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x23a (0x7f79df9ae3ea in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x2ecd68 (0x7f7a303a3d68 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f79ad343fb5 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: + 0x1db609 (0x7f7a30292609 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x4c671c (0x7f7a3057d71c in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object) + 0x292 (0x7f7a3057da22 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x13e79b (0x564387c4079b in /anaconda/envs/open-mmlab/bin/python)
frame #9: + 0x13de78 (0x564387c3fe78 in /anaconda/envs/open-mmlab/bin/python)
frame #10: + 0x13dd53 (0x564387c3fd53 in /anaconda/envs/open-mmlab/bin/python)
frame #11: + 0x13e0fc (0x564387c400fc in /anaconda/envs/open-mmlab/bin/python)
frame #12: + 0x13ec11 (0x564387c40c11 in /anaconda/envs/open-mmlab/bin/python)
frame #13: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #14: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #15: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #16: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #17: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #18: + 0x15673e (0x564387c5873e in /anaconda/envs/open-mmlab/bin/python)
frame #19: PyDict_SetItemString + 0x64 (0x564387ca0e04 in /anaconda/envs/open-mmlab/bin/python)
frame #20: + 0x28d46d (0x564387d8f46d in /anaconda/envs/open-mmlab/bin/python)
frame #21: Py_FinalizeEx + 0x175 (0x564387d8f9c5 in /anaconda/envs/open-mmlab/bin/python)
frame #22: Py_RunMain + 0x1af (0x564387d9440f in /anaconda/envs/open-mmlab/bin/python)
frame #23: Py_BytesMain + 0x39 (0x564387d947d9 in /anaconda/envs/open-mmlab/bin/python)
frame #24: __libc_start_main + 0xe7 (0x7f7a68a07bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x2125d4 (0x564387d145d4 in /anaconda/envs/open-mmlab/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 12206) of binary: /anaconda/envs/open-mmlab/bin/python
tools/dist_train.sh: line 19: 12194 Segmentation fault (core dumped) python -m torch.distributed.launch --nnodes=$NNODES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --nproc_per_node=$GPUS --master_port=$PORT $(dirname "$0")/train.py $CONFIG --seed 0 --launcher pytorch ${@:3}

deepakkupanda · 2022-05-26T13:24:13Z

I am able to run the pspnet on ade20k dataset.
tools/dist_train.sh configs/pspnet/pspnet_r101-d8_512x512_80k_ade20k.py 1

deepakkupanda · 2022-05-27T05:03:58Z

@xiaoachen98 Please help me in solving the error.

deepakkupanda · 2022-05-27T05:21:04Z

@donglixp Please help in solving the error.

MeowZheng · 2022-05-27T06:58:24Z

base on you error log,

"/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

I think something might be wrong about ignore_index. What's the ignore_index in your cfg?

deepakkupanda · 2022-05-27T07:19:41Z

@MeowZheng Can you tell where exactly to look for ignore_index ? I am not able to locate.

deepakkupanda · 2022-05-28T12:23:11Z

@MeowZheng Gentle reminder to relocate ignore_index

…checkpoint to avoid crash when running fp16 (open-mmlab#1618) * dreambooth: fix open-mmlab#1566: maintain fp32 wrapper when saving a checkpoint to avoid crash when running fp16 * dreambooth: guard against passing keep_fp32_wrapper arg to older versions of accelerate. part of fix for open-mmlab#1566 * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * Update examples/dreambooth/train_dreambooth.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* add task-level README files * update README.md * update compiling commands * update doc dependency * fix bugs * update cn readme * update cn readme * update sphinx version * fix bug * modify doc structure * fix bug * add cn doc skeleton * update cn docs

mm-assistant bot assigned xiaoachen98 May 26, 2022

MeowZheng assigned MengzhangLI and unassigned xiaoachen98 May 27, 2022

MeowZheng added the awaiting response label May 27, 2022

MeowZheng assigned MeowZheng and unassigned MengzhangLI May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) #1618

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) #1618

deepakkupanda commented May 26, 2022 •

edited

Loading

deepakkupanda commented May 26, 2022

deepakkupanda commented May 27, 2022

deepakkupanda commented May 27, 2022

MeowZheng commented May 27, 2022

deepakkupanda commented May 27, 2022

deepakkupanda commented May 28, 2022

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) #1618

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) #1618

Comments

deepakkupanda commented May 26, 2022 • edited Loading

deepakkupanda commented May 26, 2022

deepakkupanda commented May 27, 2022

deepakkupanda commented May 27, 2022

MeowZheng commented May 27, 2022

deepakkupanda commented May 27, 2022

deepakkupanda commented May 28, 2022

deepakkupanda commented May 26, 2022 •

edited

Loading