-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) #1618
Comments
I am able to run the pspnet on ade20k dataset. |
@xiaoachen98 Please help me in solving the error. |
@donglixp Please help in solving the error. |
base on you error log,
I think something might be wrong about |
@MeowZheng Can you tell where exactly to look for ignore_index ? I am not able to locate. |
@MeowZheng Gentle reminder to relocate ignore_index |
…checkpoint to avoid crash when running fp16 (open-mmlab#1618) * dreambooth: fix open-mmlab#1566: maintain fp32 wrapper when saving a checkpoint to avoid crash when running fp16 * dreambooth: guard against passing keep_fp32_wrapper arg to older versions of accelerate. part of fix for open-mmlab#1566 * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * Update examples/dreambooth/train_dreambooth.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* add task-level README files * update README.md * update compiling commands * update doc dependency * fix bugs * update cn readme * update cn readme * update sphinx version * fix bug * modify doc structure * fix bug * add cn doc skeleton * update cn docs
I am trying to run beit algorithm using duts dataset
tools/dist_train.sh configs/beit/upernet_beit-base_640x640_80k_duts_ms.py 1 --work-dir work_dirs/upernet_beit-base_640x640_80k_duts/ --deterministic
2022-05-26 12:51:23,588 - mmseg - INFO - Checkpoints will be saved to /mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/work_dirs/upernet_beit-base_640x640_80k_duts by HardDiskBackend.
2022-05-26 12:54:05,915 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/tools/train.py", line 240, in
main()
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/tools/train.py", line 229, in main
train_segmentor(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/apis/train.py", line 191, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/parallel/distributed.py", line 59, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/anaconda/envs/open-mmlab/lib/python3.10/site-packages/mmcv/runner/fp16_utils.py", line 198, in new_func
return old_func(*args, kwargs)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses
loss['acc_seg'] = accuracy(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/deepakpanda9/code/Users/deepakpanda/segmentation/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1646755897462/work/c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f79ad35d1bd in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x1f037 (0x7f79df9aa037 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x23a (0x7f79df9ae3ea in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x2ecd68 (0x7f7a303a3d68 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f79ad343fb5 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: + 0x1db609 (0x7f7a30292609 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x4c671c (0x7f7a3057d71c in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object) + 0x292 (0x7f7a3057da22 in /anaconda/envs/open-mmlab/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x13e79b (0x564387c4079b in /anaconda/envs/open-mmlab/bin/python)
frame #9: + 0x13de78 (0x564387c3fe78 in /anaconda/envs/open-mmlab/bin/python)
frame #10: + 0x13dd53 (0x564387c3fd53 in /anaconda/envs/open-mmlab/bin/python)
frame #11: + 0x13e0fc (0x564387c400fc in /anaconda/envs/open-mmlab/bin/python)
frame #12: + 0x13ec11 (0x564387c40c11 in /anaconda/envs/open-mmlab/bin/python)
frame #13: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #14: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #15: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #16: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #17: + 0x13ebf2 (0x564387c40bf2 in /anaconda/envs/open-mmlab/bin/python)
frame #18: + 0x15673e (0x564387c5873e in /anaconda/envs/open-mmlab/bin/python)
frame #19: PyDict_SetItemString + 0x64 (0x564387ca0e04 in /anaconda/envs/open-mmlab/bin/python)
frame #20: + 0x28d46d (0x564387d8f46d in /anaconda/envs/open-mmlab/bin/python)
frame #21: Py_FinalizeEx + 0x175 (0x564387d8f9c5 in /anaconda/envs/open-mmlab/bin/python)
frame #22: Py_RunMain + 0x1af (0x564387d9440f in /anaconda/envs/open-mmlab/bin/python)
frame #23: Py_BytesMain + 0x39 (0x564387d947d9 in /anaconda/envs/open-mmlab/bin/python)
frame #24: __libc_start_main + 0xe7 (0x7f7a68a07bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x2125d4 (0x564387d145d4 in /anaconda/envs/open-mmlab/bin/python)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 12206) of binary: /anaconda/envs/open-mmlab/bin/python$CONFIG --seed 0 --launcher pytorch $ {@:3}
tools/dist_train.sh: line 19: 12194 Segmentation fault (core dumped) python -m torch.distributed.launch --nnodes=$NNODES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --nproc_per_node=$GPUS --master_port=$PORT $(dirname "$0")/train.py
The text was updated successfully, but these errors were encountered: