DDP training with multiple gpu using wsl #11519

cool112624 · 2023-05-12T00:45:41Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

HI, I am training with a windows 10 machine running WSL( Windows Subsystem for Linux), but I kept receive illegal memory access error code. Does anyone have sucess experience of running in wsl

I am using two nvidia titan X

ERROR CODE

(duckegg) aoi@SuperMicro-E52690:/mnt/c/Users/andy/duckegg_linux/yolov5$ CUDA_LAUNCH_BLOCKING=1 python3 -m torch.distributed.run --nproc_per_node 2 train.py --img 320 --batch 4 --epoch 10 --data datasets.yaml --weights yolov5m.pt --device 0,1 --cache
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
train: weights=yolov5m.pt, cfg=, data=datasets.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=4, imgsz=320, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v7.0-163-g016e046 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB)
                                                            CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::Errorc10::Error'
'
  what():    what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8bc87af4d7 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8bc877936b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8bf2ab2b58 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c5be (0x7f8bf2a835be in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7f8bf2a92930 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4d5a16 (0x7f8bc3378a16 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f8bc8794e77 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f8bc878d69e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8bc878d7b9 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7f8bc337aceb in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1022d11 (0x7f8b5dafbd11 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5486439 (0x7f8baedd4439 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x548a18a (0x7f8baedd818a in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5498ff0 (0x7f8baede6ff0 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb74b3e (0x7f8bc3a17b3e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3bfdd5 (0x7f8bc3262dd5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15c99e (0x562b09cf899e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x562b09cef4ab in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #18: <unknown function> + 0x16af0b (0x562b09d06f0b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x562b09ce2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x562b09ce2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #22: <unknown function> + 0x203ea5 (0x562b09d9fea5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #23: <unknown function> + 0x15d449 (0x562b09cf9449 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #25: <unknown function> + 0x16ac31 (0x562b09d06c31 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #31: <unknown function> + 0x141ed6 (0x562b09cdded6 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x562b09dd4366 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #33: <unknown function> + 0x265108 (0x562b09e01108 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #34: <unknown function> + 0x25df5b (0x562b09df9f5b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #35: <unknown function> + 0x264e55 (0x562b09e00e55 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x562b09e00338 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x562b09e00033 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #38: Py_RunMain + 0x2be (0x562b09df12de in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x562b09dc732d in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #40: <unknown function> + 0x29d90 (0x7f8bf696fd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7f8bf696fe40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x562b09dc7225 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efe24faf4d7 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7efe24f7936b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7efe4f318b58 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c5be (0x7efe4f2e95be in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7efe4f2f8930 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4d5a16 (0x7efe1fb78a16 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7efe24f94e77 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7efe24f8d69e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7efe24f8d7b9 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7efe1fb7aceb in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1022d11 (0x7efdba2fbd11 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5486439 (0x7efe0b5d4439 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x548a18a (0x7efe0b5d818a in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5498ff0 (0x7efe0b5e6ff0 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb74b3e (0x7efe20217b3e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3bfdd5 (0x7efe1fa62dd5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15c99e (0x5566674f899e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5566674ef4ab in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #18: <unknown function> + 0x16af0b (0x556667506f0b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x5566674e2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x5566674e2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #22: <unknown function> + 0x203ea5 (0x55666759fea5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #23: <unknown function> + 0x15d449 (0x5566674f9449 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #25: <unknown function> + 0x16ae91 (0x556667506e91 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x2753 (0x5566674e3803 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #31: <unknown function> + 0x141ed6 (0x5566674dded6 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x5566675d4366 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #33: <unknown function> + 0x265108 (0x556667601108 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #34: <unknown function> + 0x25df5b (0x5566675f9f5b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #35: <unknown function> + 0x264e55 (0x556667600e55 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x556667600338 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x556667600033 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #38: Py_RunMain + 0x2be (0x5566675f12de in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x5566675c732d in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #40: <unknown function> + 0x29d90 (0x7efe531d4d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7efe531d4e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x5566675c7225 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)


WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11413 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 11412) of binary: /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-11_23:01:26
  host      : SuperMicro-E52690.
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 11412)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 11412
======================================================

Additional

No response

The text was updated successfully, but these errors were encountered:

cool112624 · 2023-05-14T06:59:52Z

@glenn-jocher HI, do you try ddp training with wsl or just linux system and docker?

glenn-jocher · 2023-05-14T09:36:19Z

@cool112624 hello! Thank you for your question. Distributed Data Parallel (DDP) training works on both Linux and WSL systems, as well as with Docker. Let us know if you have any further questions or concerns!

cool112624 · 2023-05-16T05:23:03Z

@glenn-jocher I run train.py in wsl with two gpu but it show the error code above, can you help me see what is the problem?

glenn-jocher · 2023-05-16T08:21:54Z

@cool112624 hi, thank you for reaching out. Can you share the error code that you're encountering? It will be easier to identify the root cause and provide a solution if we can take a look at the specific error message.

cool112624 · 2023-05-16T08:38:27Z

@glenn-jocher Thank you for your time, below are my error code.

ERROR CODE

(duckegg) aoi@SuperMicro-E52690:/mnt/c/Users/andy/duckegg_linux/yolov5$ CUDA_LAUNCH_BLOCKING=1 python3 -m torch.distributed.run --nproc_per_node 2 train.py --img 320 --batch 4 --epoch 10 --data datasets.yaml --weights yolov5m.pt --device 0,1 --cache
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
train: weights=yolov5m.pt, cfg=, data=datasets.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=4, imgsz=320, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v7.0-163-g016e046 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB)
                                                            CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::Errorc10::Error'
'
  what():    what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8bc87af4d7 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8bc877936b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8bf2ab2b58 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c5be (0x7f8bf2a835be in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7f8bf2a92930 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4d5a16 (0x7f8bc3378a16 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f8bc8794e77 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f8bc878d69e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8bc878d7b9 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7f8bc337aceb in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1022d11 (0x7f8b5dafbd11 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5486439 (0x7f8baedd4439 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x548a18a (0x7f8baedd818a in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5498ff0 (0x7f8baede6ff0 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb74b3e (0x7f8bc3a17b3e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3bfdd5 (0x7f8bc3262dd5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15c99e (0x562b09cf899e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x562b09cef4ab in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #18: <unknown function> + 0x16af0b (0x562b09d06f0b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x562b09ce2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x562b09ce2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #22: <unknown function> + 0x203ea5 (0x562b09d9fea5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #23: <unknown function> + 0x15d449 (0x562b09cf9449 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #25: <unknown function> + 0x16ac31 (0x562b09d06c31 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #31: <unknown function> + 0x141ed6 (0x562b09cdded6 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x562b09dd4366 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #33: <unknown function> + 0x265108 (0x562b09e01108 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #34: <unknown function> + 0x25df5b (0x562b09df9f5b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #35: <unknown function> + 0x264e55 (0x562b09e00e55 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x562b09e00338 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x562b09e00033 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #38: Py_RunMain + 0x2be (0x562b09df12de in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x562b09dc732d in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #40: <unknown function> + 0x29d90 (0x7f8bf696fd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7f8bf696fe40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x562b09dc7225 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efe24faf4d7 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7efe24f7936b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7efe4f318b58 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c5be (0x7efe4f2e95be in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7efe4f2f8930 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4d5a16 (0x7efe1fb78a16 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7efe24f94e77 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7efe24f8d69e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7efe24f8d7b9 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7efe1fb7aceb in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1022d11 (0x7efdba2fbd11 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5486439 (0x7efe0b5d4439 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x548a18a (0x7efe0b5d818a in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5498ff0 (0x7efe0b5e6ff0 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb74b3e (0x7efe20217b3e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3bfdd5 (0x7efe1fa62dd5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15c99e (0x5566674f899e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5566674ef4ab in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #18: <unknown function> + 0x16af0b (0x556667506f0b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x5566674e2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x5566674e2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #22: <unknown function> + 0x203ea5 (0x55666759fea5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #23: <unknown function> + 0x15d449 (0x5566674f9449 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #25: <unknown function> + 0x16ae91 (0x556667506e91 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x2753 (0x5566674e3803 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #31: <unknown function> + 0x141ed6 (0x5566674dded6 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x5566675d4366 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #33: <unknown function> + 0x265108 (0x556667601108 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #34: <unknown function> + 0x25df5b (0x5566675f9f5b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #35: <unknown function> + 0x264e55 (0x556667600e55 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x556667600338 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x556667600033 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #38: Py_RunMain + 0x2be (0x5566675f12de in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x5566675c732d in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #40: <unknown function> + 0x29d90 (0x7efe531d4d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7efe531d4e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x5566675c7225 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)


WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11413 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 11412) of binary: /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-11_23:01:26
  host      : SuperMicro-E52690.
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 11412)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 11412
======================================================

cool112624 · 2023-05-25T13:27:13Z

Hi @glenn-jocher, can you see my error code that I paste above ?

glenn-jocher · 2023-05-25T15:51:16Z

@cool112624 hello there,

Thank you for reaching out. We would be glad to help you with your error code. Please provide us with more details regarding your issue, such as the version of YOLOv5 you are using and the steps you followed before encountering the error. This will help us better understand and address your problem.

Looking forward to hearing back from you.

Thank you.

cool112624 · 2023-05-26T14:36:37Z

HI @glenn-jocher , thank you for your reply

I use the newest version of yolov5 (2023.05.26)
Pytorch 2.0.1 with CUDA 11.8

My system is
Intel Xeon CPU E5-2690 v4 @ 2.60GHz
Total ram of 192GB
2 X NVIDIA TITAN X (Pascal)

I am using wsl from windows cmd and my command line is

python -m torch.distributed.run --nproc_per_node 2 train.py --batch 32 --epoch 100 --data coco.yaml --weights yolov5n.pt --device 0,1

the new error code is as shown below

ERROR CODE

''' WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** train: weights=yolov5n.pt, cfg=, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 2023-5-26 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB) CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

Dataset not found ⚠️, missing paths ['/mnt/c/Users/andy/duckegg_linux/datasets/coco/val2017.txt']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco2017labels.zip to /mnt/c/Users/andy/duckegg_linux/datasets/coco2017labels.zip...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46.4M/46.4M [00:01<00:00, 24.6MB/s]
Unzipping /mnt/c/Users/andy/duckegg_linux/datasets/coco2017labels.zip...
^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1325 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1326 closing signal SIGINT
Traceback (most recent call last):
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 586, in unzip_file
zipObj.extract(f, path=path)
File "/usr/lib/python3.10/zipfile.py", line 1628, in extract
return self._extract_member(member, path, pwd)
File "/usr/lib/python3.10/zipfile.py", line 1698, in _extract_member
with self.open(member, pwd=pwd) as source,
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/c/Users/andy/duckegg_linux/yolov5/train.py", line 642, in
main(opt)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/train.py", line 531, in main
train(opt.hyp, opt, device, callbacks)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/train.py", line 112, in train
data_dict = data_dict or check_dataset(data) # check if None
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 531, in check_dataset
r = exec(s, {'yaml': data}) # return None
File "", line 9, in
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 638, in download
download_one(u, dir)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 621, in download_one
unzip_file(f, dir) # unzip
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 586, in unzip_file
zipObj.extract(f, path=path)
KeyboardInterrupt
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 1326 via 2, forcefully exiting via 9
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in
main()
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1267 got signal: 2
'''

glenn-jocher · 2023-05-26T18:20:17Z

Dear @cool112624,

Thank you for providing the details of your system and the error log you encountered. We can see that the error pertains to a missing dataset. Specifically, the error says that the "Dataset is not found, missing paths ['/mnt/c/Users/andy/duckegg_linux/datasets/coco/val2017.txt']" This suggests that there might be an issue with the path or directory of your dataset.

We advise that you review and verify the location and accessibility of your dataset. Additionally, please ensure that you have provided the correct path of your dataset in your command line.

We hope this information helps you address your issue with YOLOv5. If you have any further concerns or questions, don't hesitate to reach out.

Best regards.

cool112624 · 2023-05-27T11:58:23Z

Hi @glenn-jocher , This is the new error code after I solve the missing datasets

ERROR CODE

''' WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** train: weights=yolov5n.pt, cfg=, data=datasets/coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=2, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 2023-5-26 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB) CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f23515174d7 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f23514e136b in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f232713fb58 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1c5be (0x7f23271105be in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x2b930 (0x7f232711f930 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d5a16 (0x7f2321d78a16 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3ee77 (0x7f23514fce77 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f23514f569e in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f23514f57b9 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x8b (0x7f2321d7aceb in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x1022d11 (0x7f22bc4fbd11 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x5486439 (0x7f230d7d4439 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x548a18a (0x7f230d7d818a in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x5498ff0 (0x7f230d7e6ff0 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0xb74b3e (0x7f2322417b3e in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: + 0x3bfdd5 (0x7f2321c62dd5 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: + 0x15c99e (0x556b733b999e in /usr/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x556b733b04ab in /usr/bin/python3)
frame #18: + 0x16af0b (0x556b733c7f0b in /usr/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x556b733a3adf in /usr/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x556b733ba1ec in /usr/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x556b733a3adf in /usr/bin/python3)
frame #22: + 0x203ea5 (0x556b73460ea5 in /usr/bin/python3)
frame #23: + 0x15d449 (0x556b733ba449 in /usr/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x556b733a2785 in /usr/bin/python3)
frame #25: + 0x16ac31 (0x556b733c7c31 in /usr/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x6d5 (0x556b733a2785 in /usr/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x556b733ba1ec in /usr/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x556b733a2785 in /usr/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x556b733ba1ec in /usr/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x556b733a2785 in /usr/bin/python3)
frame #31: + 0x141ed6 (0x556b7339eed6 in /usr/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x556b73495366 in /usr/bin/python3)
frame #33: + 0x265108 (0x556b734c2108 in /usr/bin/python3)
frame #34: + 0x25df5b (0x556b734baf5b in /usr/bin/python3)
frame #35: + 0x264e55 (0x556b734c1e55 in /usr/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x556b734c1338 in /usr/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x556b734c1033 in /usr/bin/python3)
frame #38: Py_RunMain + 0x2be (0x556b734b22de in /usr/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x556b7348832d in /usr/bin/python3)
frame #40: + 0x29d90 (0x7f2355361d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7f2355361e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x556b73488225 in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa339daf4d7 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa339d7936b in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa36410fb58 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1c5be (0x7fa3640e05be in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x2b930 (0x7fa3640ef930 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d5a16 (0x7fa334978a16 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3ee77 (0x7fa339d94e77 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fa339d8d69e in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fa339d8d7b9 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x8b (0x7fa33497aceb in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x1022d11 (0x7fa2cf0fbd11 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x5486439 (0x7fa3203d4439 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x548a18a (0x7fa3203d818a in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x5498ff0 (0x7fa3203e6ff0 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0xb74b3e (0x7fa335017b3e in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: + 0x3bfdd5 (0x7fa334862dd5 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: + 0x15c99e (0x5647a139e99e in /usr/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5647a13954ab in /usr/bin/python3)
frame #18: + 0x16af0b (0x5647a13acf0b in /usr/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x5647a1388adf in /usr/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5647a139f1ec in /usr/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x5647a1388adf in /usr/bin/python3)
frame #22: + 0x203ea5 (0x5647a1445ea5 in /usr/bin/python3)
frame #23: + 0x15d449 (0x5647a139f449 in /usr/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x5647a1387785 in /usr/bin/python3)
frame #25: + 0x16ae91 (0x5647a13ace91 in /usr/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x2753 (0x5647a1389803 in /usr/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x5647a139f1ec in /usr/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x5647a1387785 in /usr/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5647a139f1ec in /usr/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x5647a1387785 in /usr/bin/python3)
frame #31: + 0x141ed6 (0x5647a1383ed6 in /usr/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x5647a147a366 in /usr/bin/python3)
frame #33: + 0x265108 (0x5647a14a7108 in /usr/bin/python3)
frame #34: + 0x25df5b (0x5647a149ff5b in /usr/bin/python3)
frame #35: + 0x264e55 (0x5647a14a6e55 in /usr/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x5647a14a6338 in /usr/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x5647a14a6033 in /usr/bin/python3)
frame #38: Py_RunMain + 0x2be (0x5647a14972de in /usr/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x5647a146d32d in /usr/bin/python3)
frame #40: + 0x29d90 (0x7fa367fcfd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7fa367fcfe40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x5647a146d225 in /usr/bin/python3)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 11087) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in
main()
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-05-27_19:56:16
host : SuperMicro-E52690.
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 11088)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 11088

Root Cause (first observed failure):
[0]:
time : 2023-05-27_19:56:16
host : SuperMicro-E52690.
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 11087)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 11087

'''

cool112624 · 2023-05-30T08:10:58Z

Hi @glenn-jocher, what does illegal memory acess points to ?

glenn-jocher · 2023-05-30T10:49:48Z

@cool112624 hello,

Illegal memory access usually means that the program is attempting to access memory that it is not allowed to access. This can happen for a variety of reasons, such as trying to access a null pointer or trying to access memory that has already been freed.

I hope this helps! Let us know if you have any further questions.

Thank you.

cool112624 · 2023-05-30T11:27:21Z

@glenn-jocher
My memory is available
what else reasons could it be the pytorch version and cuda version are fine with single gpu so I guess it may not be the problem

glenn-jocher · 2023-05-30T15:03:48Z

@cool112624 hello,

Thank you for reaching out to us. While memory availability is definitely a factor that can impact YOLOv5's performance, there could be other factors at play.

You mentioned that your PyTorch and CUDA versions are up to date and running fine with a single GPU, and that's a good starting point. However, other factors, such as the size and complexity of your dataset, the batch size you're running, and even the specific hardware you're using can also play a role in determining performance.

I would recommend checking your batch size and dataset to see if reducing the former or simplifying the latter improves performance. Additionally, if possible, testing on different hardware can also provide valuable insights.

Let us know if you have any further questions or concerns.

Thank you.

cool112624 · 2023-05-30T15:26:06Z

@glenn-jocher Hi, thank you for your reply
I have tested with reduce batch size and img size, the train images number is same with when running single gpu training which is 13000 images

the batch size I had tried is from 2, 4, 8, 16 and 32
and image size of 448, 544 and 640

cool112624 · 2023-06-05T07:22:59Z

Hi, @glenn-jocher, or can I know a environment where you sucessfully ran this ddp multi gpu training so that I can make a replica of your environment and try out. Thank you in advance

github-actions · 2023-07-06T00:26:56Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

glenn-jocher · 2023-11-14T17:44:10Z

@cool112624 Hello,

I appreciate your thorough testing. Unfortunately, I cannot provide an environment where I've personally tested DDP multi-GPU training, as our testing and development environments are diverse and may not be reproducible in a generic setting. However, I encourage you to refer to the official YOLOv5 documentation and community forums for successful case studies and potential environment configurations. Collaborating with the YOLO community or reaching out to fellow users who have experience in DDP multi-GPU training could also be beneficial.

Should you have any further questions or concerns, please don't hesitate to ask.

Thank you.

cool112624 added the question Further information is requested label May 12, 2023

github-actions bot added the Stale Stale and schedule for closing soon label Jul 6, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP training with multiple gpu using wsl #11519

DDP training with multiple gpu using wsl #11519

cool112624 commented May 12, 2023

Additional

cool112624 commented May 14, 2023

glenn-jocher commented May 14, 2023

cool112624 commented May 16, 2023

glenn-jocher commented May 16, 2023

cool112624 commented May 16, 2023 •

edited

Loading

cool112624 commented May 25, 2023

glenn-jocher commented May 25, 2023

cool112624 commented May 26, 2023

glenn-jocher commented May 26, 2023

cool112624 commented May 27, 2023

train.py FAILED

Failures:
[1]:
time : 2023-05-27_19:56:16
host : SuperMicro-E52690.
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 11088)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 11088

Root Cause (first observed failure):
[0]:
time : 2023-05-27_19:56:16
host : SuperMicro-E52690.
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 11087)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 11087

cool112624 commented May 30, 2023

glenn-jocher commented May 30, 2023

cool112624 commented May 30, 2023

glenn-jocher commented May 30, 2023

cool112624 commented May 30, 2023

cool112624 commented Jun 5, 2023

github-actions bot commented Jul 6, 2023

glenn-jocher commented Nov 14, 2023

DDP training with multiple gpu using wsl #11519

DDP training with multiple gpu using wsl #11519

Comments

cool112624 commented May 12, 2023

Search before asking

Question

Additional

cool112624 commented May 14, 2023

glenn-jocher commented May 14, 2023

cool112624 commented May 16, 2023

glenn-jocher commented May 16, 2023

cool112624 commented May 16, 2023 • edited Loading

cool112624 commented May 25, 2023

glenn-jocher commented May 25, 2023

cool112624 commented May 26, 2023

glenn-jocher commented May 26, 2023

cool112624 commented May 27, 2023

train.py FAILED

Failures: [1]: time : 2023-05-27_19:56:16 host : SuperMicro-E52690. rank : 1 (local_rank: 1) exitcode : -6 (pid: 11088) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 11088

Root Cause (first observed failure): [0]: time : 2023-05-27_19:56:16 host : SuperMicro-E52690. rank : 0 (local_rank: 0) exitcode : -6 (pid: 11087) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 11087

cool112624 commented May 30, 2023

glenn-jocher commented May 30, 2023

cool112624 commented May 30, 2023

glenn-jocher commented May 30, 2023

cool112624 commented May 30, 2023

cool112624 commented Jun 5, 2023

github-actions bot commented Jul 6, 2023

glenn-jocher commented Nov 14, 2023

cool112624 commented May 16, 2023 •

edited

Loading

Failures:
[1]:
time : 2023-05-27_19:56:16
host : SuperMicro-E52690.
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 11088)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 11088

Root Cause (first observed failure):
[0]:
time : 2023-05-27_19:56:16
host : SuperMicro-E52690.
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 11087)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 11087