Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP training with multiple gpu using wsl #11519

Closed
1 task done
cool112624 opened this issue May 12, 2023 · 18 comments
Closed
1 task done

DDP training with multiple gpu using wsl #11519

cool112624 opened this issue May 12, 2023 · 18 comments
Labels
question Further information is requested Stale Stale and schedule for closing soon

Comments

@cool112624
Copy link

Search before asking

Question

HI, I am training with a windows 10 machine running WSL( Windows Subsystem for Linux), but I kept receive illegal memory access error code. Does anyone have sucess experience of running in wsl

I am using two nvidia titan X

ERROR CODE
(duckegg) aoi@SuperMicro-E52690:/mnt/c/Users/andy/duckegg_linux/yolov5$ CUDA_LAUNCH_BLOCKING=1 python3 -m torch.distributed.run --nproc_per_node 2 train.py --img 320 --batch 4 --epoch 10 --data datasets.yaml --weights yolov5m.pt --device 0,1 --cache
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
train: weights=yolov5m.pt, cfg=, data=datasets.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=4, imgsz=320, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v7.0-163-g016e046 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB)
                                                            CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::Errorc10::Error'
'
  what():    what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8bc87af4d7 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8bc877936b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8bf2ab2b58 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c5be (0x7f8bf2a835be in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7f8bf2a92930 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4d5a16 (0x7f8bc3378a16 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f8bc8794e77 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f8bc878d69e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8bc878d7b9 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7f8bc337aceb in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1022d11 (0x7f8b5dafbd11 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5486439 (0x7f8baedd4439 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x548a18a (0x7f8baedd818a in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5498ff0 (0x7f8baede6ff0 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb74b3e (0x7f8bc3a17b3e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3bfdd5 (0x7f8bc3262dd5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15c99e (0x562b09cf899e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x562b09cef4ab in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #18: <unknown function> + 0x16af0b (0x562b09d06f0b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x562b09ce2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x562b09ce2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #22: <unknown function> + 0x203ea5 (0x562b09d9fea5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #23: <unknown function> + 0x15d449 (0x562b09cf9449 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #25: <unknown function> + 0x16ac31 (0x562b09d06c31 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #31: <unknown function> + 0x141ed6 (0x562b09cdded6 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x562b09dd4366 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #33: <unknown function> + 0x265108 (0x562b09e01108 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #34: <unknown function> + 0x25df5b (0x562b09df9f5b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #35: <unknown function> + 0x264e55 (0x562b09e00e55 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x562b09e00338 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x562b09e00033 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #38: Py_RunMain + 0x2be (0x562b09df12de in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x562b09dc732d in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #40: <unknown function> + 0x29d90 (0x7f8bf696fd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7f8bf696fe40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x562b09dc7225 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efe24faf4d7 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7efe24f7936b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7efe4f318b58 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c5be (0x7efe4f2e95be in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7efe4f2f8930 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4d5a16 (0x7efe1fb78a16 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7efe24f94e77 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7efe24f8d69e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7efe24f8d7b9 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7efe1fb7aceb in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1022d11 (0x7efdba2fbd11 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5486439 (0x7efe0b5d4439 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x548a18a (0x7efe0b5d818a in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5498ff0 (0x7efe0b5e6ff0 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb74b3e (0x7efe20217b3e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3bfdd5 (0x7efe1fa62dd5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15c99e (0x5566674f899e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5566674ef4ab in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #18: <unknown function> + 0x16af0b (0x556667506f0b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x5566674e2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x5566674e2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #22: <unknown function> + 0x203ea5 (0x55666759fea5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #23: <unknown function> + 0x15d449 (0x5566674f9449 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #25: <unknown function> + 0x16ae91 (0x556667506e91 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x2753 (0x5566674e3803 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #31: <unknown function> + 0x141ed6 (0x5566674dded6 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x5566675d4366 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #33: <unknown function> + 0x265108 (0x556667601108 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #34: <unknown function> + 0x25df5b (0x5566675f9f5b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #35: <unknown function> + 0x264e55 (0x556667600e55 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x556667600338 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x556667600033 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #38: Py_RunMain + 0x2be (0x5566675f12de in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x5566675c732d in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #40: <unknown function> + 0x29d90 (0x7efe531d4d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7efe531d4e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x5566675c7225 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)


WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11413 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 11412) of binary: /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-11_23:01:26
  host      : SuperMicro-E52690.
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 11412)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 11412
======================================================

Additional

No response

@cool112624 cool112624 added the question Further information is requested label May 12, 2023
@cool112624
Copy link
Author

@glenn-jocher HI, do you try ddp training with wsl or just linux system and docker?

@glenn-jocher
Copy link
Member

@cool112624 hello! Thank you for your question. Distributed Data Parallel (DDP) training works on both Linux and WSL systems, as well as with Docker. Let us know if you have any further questions or concerns!

@cool112624
Copy link
Author

@glenn-jocher I run train.py in wsl with two gpu but it show the error code above, can you help me see what is the problem?

@glenn-jocher
Copy link
Member

@cool112624 hi, thank you for reaching out. Can you share the error code that you're encountering? It will be easier to identify the root cause and provide a solution if we can take a look at the specific error message.

@cool112624
Copy link
Author

cool112624 commented May 16, 2023

@glenn-jocher Thank you for your time, below are my error code.

ERROR CODE
(duckegg) aoi@SuperMicro-E52690:/mnt/c/Users/andy/duckegg_linux/yolov5$ CUDA_LAUNCH_BLOCKING=1 python3 -m torch.distributed.run --nproc_per_node 2 train.py --img 320 --batch 4 --epoch 10 --data datasets.yaml --weights yolov5m.pt --device 0,1 --cache
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
train: weights=yolov5m.pt, cfg=, data=datasets.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=4, imgsz=320, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v7.0-163-g016e046 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB)
                                                            CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::Errorc10::Error'
'
  what():    what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8bc87af4d7 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8bc877936b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8bf2ab2b58 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c5be (0x7f8bf2a835be in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7f8bf2a92930 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4d5a16 (0x7f8bc3378a16 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f8bc8794e77 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f8bc878d69e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8bc878d7b9 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7f8bc337aceb in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1022d11 (0x7f8b5dafbd11 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5486439 (0x7f8baedd4439 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x548a18a (0x7f8baedd818a in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5498ff0 (0x7f8baede6ff0 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb74b3e (0x7f8bc3a17b3e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3bfdd5 (0x7f8bc3262dd5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15c99e (0x562b09cf899e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x562b09cef4ab in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #18: <unknown function> + 0x16af0b (0x562b09d06f0b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x562b09ce2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x562b09ce2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #22: <unknown function> + 0x203ea5 (0x562b09d9fea5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #23: <unknown function> + 0x15d449 (0x562b09cf9449 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #25: <unknown function> + 0x16ac31 (0x562b09d06c31 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x562b09cf91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x562b09ce1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #31: <unknown function> + 0x141ed6 (0x562b09cdded6 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x562b09dd4366 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #33: <unknown function> + 0x265108 (0x562b09e01108 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #34: <unknown function> + 0x25df5b (0x562b09df9f5b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #35: <unknown function> + 0x264e55 (0x562b09e00e55 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x562b09e00338 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x562b09e00033 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #38: Py_RunMain + 0x2be (0x562b09df12de in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x562b09dc732d in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #40: <unknown function> + 0x29d90 (0x7f8bf696fd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7f8bf696fe40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x562b09dc7225 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efe24faf4d7 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7efe24f7936b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7efe4f318b58 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1c5be (0x7efe4f2e95be in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2b930 (0x7efe4f2f8930 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4d5a16 (0x7efe1fb78a16 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7efe24f94e77 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7efe24f8d69e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7efe24f8d7b9 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector() + 0x8b (0x7efe1fb7aceb in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1022d11 (0x7efdba2fbd11 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5486439 (0x7efe0b5d4439 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x548a18a (0x7efe0b5d818a in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5498ff0 (0x7efe0b5e6ff0 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb74b3e (0x7efe20217b3e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x3bfdd5 (0x7efe1fa62dd5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15c99e (0x5566674f899e in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5566674ef4ab in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #18: <unknown function> + 0x16af0b (0x556667506f0b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x5566674e2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x5566674e2adf in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #22: <unknown function> + 0x203ea5 (0x55666759fea5 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #23: <unknown function> + 0x15d449 (0x5566674f9449 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #25: <unknown function> + 0x16ae91 (0x556667506e91 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x2753 (0x5566674e3803 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5566674f91ec in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x5566674e1785 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #31: <unknown function> + 0x141ed6 (0x5566674dded6 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x5566675d4366 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #33: <unknown function> + 0x265108 (0x556667601108 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #34: <unknown function> + 0x25df5b (0x5566675f9f5b in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #35: <unknown function> + 0x264e55 (0x556667600e55 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x556667600338 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x556667600033 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #38: Py_RunMain + 0x2be (0x5566675f12de in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x5566675c732d in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)
frame #40: <unknown function> + 0x29d90 (0x7efe531d4d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7efe531d4e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x5566675c7225 in /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3)


WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11413 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 11412) of binary: /mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-11_23:01:26
  host      : SuperMicro-E52690.
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 11412)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 11412
======================================================

@cool112624
Copy link
Author

Hi @glenn-jocher, can you see my error code that I paste above ?

@glenn-jocher
Copy link
Member

@cool112624 hello there,

Thank you for reaching out. We would be glad to help you with your error code. Please provide us with more details regarding your issue, such as the version of YOLOv5 you are using and the steps you followed before encountering the error. This will help us better understand and address your problem.

Looking forward to hearing back from you.

Thank you.

@cool112624
Copy link
Author

HI @glenn-jocher , thank you for your reply

I use the newest version of yolov5 (2023.05.26)
Pytorch 2.0.1 with CUDA 11.8

My system is
Intel Xeon CPU E5-2690 v4 @ 2.60GHz
Total ram of 192GB
2 X NVIDIA TITAN X (Pascal)

I am using wsl from windows cmd and my command line is

python -m torch.distributed.run --nproc_per_node 2 train.py --batch 32 --epoch 100 --data coco.yaml --weights yolov5n.pt --device 0,1

the new error code is as shown below

ERROR CODE ''' WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** train: weights=yolov5n.pt, cfg=, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 2023-5-26 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB) CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

Dataset not found ⚠️, missing paths ['/mnt/c/Users/andy/duckegg_linux/datasets/coco/val2017.txt']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco2017labels.zip to /mnt/c/Users/andy/duckegg_linux/datasets/coco2017labels.zip...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46.4M/46.4M [00:01<00:00, 24.6MB/s]
Unzipping /mnt/c/Users/andy/duckegg_linux/datasets/coco2017labels.zip...
^CWARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1325 closing signal SIGINT
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1326 closing signal SIGINT
Traceback (most recent call last):
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 586, in unzip_file
zipObj.extract(f, path=path)
File "/usr/lib/python3.10/zipfile.py", line 1628, in extract
return self._extract_member(member, path, pwd)
File "/usr/lib/python3.10/zipfile.py", line 1698, in _extract_member
with self.open(member, pwd=pwd) as source,
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/c/Users/andy/duckegg_linux/yolov5/train.py", line 642, in
main(opt)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/train.py", line 531, in main
train(opt.hyp, opt, device, callbacks)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/train.py", line 112, in train
data_dict = data_dict or check_dataset(data) # check if None
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 531, in check_dataset
r = exec(s, {'yaml': data}) # return None
File "", line 9, in
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 638, in download
download_one(u, dir)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 621, in download_one
unzip_file(f, dir) # unzip
File "/mnt/c/Users/andy/duckegg_linux/yolov5/utils/general.py", line 586, in unzip_file
zipObj.extract(f, path=path)
KeyboardInterrupt
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 1326 via 2, forcefully exiting via 9
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in
main()
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/mnt/c/Users/andy/duckegg_linux/yolov5/duckegg/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1267 got signal: 2
'''

@glenn-jocher
Copy link
Member

Dear @cool112624,

Thank you for providing the details of your system and the error log you encountered. We can see that the error pertains to a missing dataset. Specifically, the error says that the "Dataset is not found, missing paths ['/mnt/c/Users/andy/duckegg_linux/datasets/coco/val2017.txt']" This suggests that there might be an issue with the path or directory of your dataset.

We advise that you review and verify the location and accessibility of your dataset. Additionally, please ensure that you have provided the correct path of your dataset in your command line.

We hope this information helps you address your issue with YOLOv5. If you have any further concerns or questions, don't hesitate to reach out.

Best regards.

@cool112624
Copy link
Author

Hi @glenn-jocher , This is the new error code after I solve the missing datasets

ERROR CODE ''' WARNING:__main__: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** train: weights=yolov5n.pt, cfg=, data=datasets/coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=2, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5 YOLOv5 🚀 2023-5-26 Python-3.10.6 torch-2.0.1+cu118 CUDA:0 (NVIDIA TITAN X (Pascal), 12288MiB) CUDA:1 (NVIDIA TITAN X (Pascal), 12288MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f23515174d7 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f23514e136b in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f232713fb58 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1c5be (0x7f23271105be in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x2b930 (0x7f232711f930 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d5a16 (0x7f2321d78a16 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3ee77 (0x7f23514fce77 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f23514f569e in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f23514f57b9 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x8b (0x7f2321d7aceb in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x1022d11 (0x7f22bc4fbd11 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x5486439 (0x7f230d7d4439 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x548a18a (0x7f230d7d818a in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x5498ff0 (0x7f230d7e6ff0 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0xb74b3e (0x7f2322417b3e in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: + 0x3bfdd5 (0x7f2321c62dd5 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: + 0x15c99e (0x556b733b999e in /usr/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x556b733b04ab in /usr/bin/python3)
frame #18: + 0x16af0b (0x556b733c7f0b in /usr/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x556b733a3adf in /usr/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x556b733ba1ec in /usr/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x556b733a3adf in /usr/bin/python3)
frame #22: + 0x203ea5 (0x556b73460ea5 in /usr/bin/python3)
frame #23: + 0x15d449 (0x556b733ba449 in /usr/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x556b733a2785 in /usr/bin/python3)
frame #25: + 0x16ac31 (0x556b733c7c31 in /usr/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x6d5 (0x556b733a2785 in /usr/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x556b733ba1ec in /usr/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x556b733a2785 in /usr/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x556b733ba1ec in /usr/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x556b733a2785 in /usr/bin/python3)
frame #31: + 0x141ed6 (0x556b7339eed6 in /usr/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x556b73495366 in /usr/bin/python3)
frame #33: + 0x265108 (0x556b734c2108 in /usr/bin/python3)
frame #34: + 0x25df5b (0x556b734baf5b in /usr/bin/python3)
frame #35: + 0x264e55 (0x556b734c1e55 in /usr/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x556b734c1338 in /usr/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x556b734c1033 in /usr/bin/python3)
frame #38: Py_RunMain + 0x2be (0x556b734b22de in /usr/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x556b7348832d in /usr/bin/python3)
frame #40: + 0x29d90 (0x7f2355361d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7f2355361e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x556b73488225 in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa339daf4d7 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fa339d7936b in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fa36410fb58 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1c5be (0x7fa3640e05be in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x2b930 (0x7fa3640ef930 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d5a16 (0x7fa334978a16 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3ee77 (0x7fa339d94e77 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fa339d8d69e in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fa339d8d7b9 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x8b (0x7fa33497aceb in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x1022d11 (0x7fa2cf0fbd11 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x5486439 (0x7fa3203d4439 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x548a18a (0x7fa3203d818a in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x5498ff0 (0x7fa3203e6ff0 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0xb74b3e (0x7fa335017b3e in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: + 0x3bfdd5 (0x7fa334862dd5 in /home/aoi/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: + 0x15c99e (0x5647a139e99e in /usr/bin/python3)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5647a13954ab in /usr/bin/python3)
frame #18: + 0x16af0b (0x5647a13acf0b in /usr/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x1a2f (0x5647a1388adf in /usr/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5647a139f1ec in /usr/bin/python3)
frame #21: _PyEval_EvalFrameDefault + 0x1a2f (0x5647a1388adf in /usr/bin/python3)
frame #22: + 0x203ea5 (0x5647a1445ea5 in /usr/bin/python3)
frame #23: + 0x15d449 (0x5647a139f449 in /usr/bin/python3)
frame #24: _PyEval_EvalFrameDefault + 0x6d5 (0x5647a1387785 in /usr/bin/python3)
frame #25: + 0x16ae91 (0x5647a13ace91 in /usr/bin/python3)
frame #26: _PyEval_EvalFrameDefault + 0x2753 (0x5647a1389803 in /usr/bin/python3)
frame #27: _PyFunction_Vectorcall + 0x7c (0x5647a139f1ec in /usr/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x6d5 (0x5647a1387785 in /usr/bin/python3)
frame #29: _PyFunction_Vectorcall + 0x7c (0x5647a139f1ec in /usr/bin/python3)
frame #30: _PyEval_EvalFrameDefault + 0x6d5 (0x5647a1387785 in /usr/bin/python3)
frame #31: + 0x141ed6 (0x5647a1383ed6 in /usr/bin/python3)
frame #32: PyEval_EvalCode + 0x86 (0x5647a147a366 in /usr/bin/python3)
frame #33: + 0x265108 (0x5647a14a7108 in /usr/bin/python3)
frame #34: + 0x25df5b (0x5647a149ff5b in /usr/bin/python3)
frame #35: + 0x264e55 (0x5647a14a6e55 in /usr/bin/python3)
frame #36: _PyRun_SimpleFileObject + 0x1a8 (0x5647a14a6338 in /usr/bin/python3)
frame #37: _PyRun_AnyFileObject + 0x43 (0x5647a14a6033 in /usr/bin/python3)
frame #38: Py_RunMain + 0x2be (0x5647a14972de in /usr/bin/python3)
frame #39: Py_BytesMain + 0x2d (0x5647a146d32d in /usr/bin/python3)
frame #40: + 0x29d90 (0x7fa367fcfd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: __libc_start_main + 0x80 (0x7fa367fcfe40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: _start + 0x25 (0x5647a146d225 in /usr/bin/python3)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 11087) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in
main()
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aoi/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-05-27_19:56:16
host : SuperMicro-E52690.
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 11088)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 11088

Root Cause (first observed failure):
[0]:
time : 2023-05-27_19:56:16
host : SuperMicro-E52690.
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 11087)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 11087

'''

@cool112624
Copy link
Author

Hi @glenn-jocher, what does illegal memory acess points to ?

@glenn-jocher
Copy link
Member

@cool112624 hello,

Illegal memory access usually means that the program is attempting to access memory that it is not allowed to access. This can happen for a variety of reasons, such as trying to access a null pointer or trying to access memory that has already been freed.

I hope this helps! Let us know if you have any further questions.

Thank you.

@cool112624
Copy link
Author

@glenn-jocher
My memory is available
what else reasons could it be the pytorch version and cuda version are fine with single gpu so I guess it may not be the problem

@glenn-jocher
Copy link
Member

@cool112624 hello,

Thank you for reaching out to us. While memory availability is definitely a factor that can impact YOLOv5's performance, there could be other factors at play.

You mentioned that your PyTorch and CUDA versions are up to date and running fine with a single GPU, and that's a good starting point. However, other factors, such as the size and complexity of your dataset, the batch size you're running, and even the specific hardware you're using can also play a role in determining performance.

I would recommend checking your batch size and dataset to see if reducing the former or simplifying the latter improves performance. Additionally, if possible, testing on different hardware can also provide valuable insights.

Let us know if you have any further questions or concerns.

Thank you.

@cool112624
Copy link
Author

@glenn-jocher Hi, thank you for your reply
I have tested with reduce batch size and img size, the train images number is same with when running single gpu training which is 13000 images

the batch size I had tried is from 2, 4, 8, 16 and 32
and image size of 448, 544 and 640

@cool112624
Copy link
Author

Hi, @glenn-jocher, or can I know a environment where you sucessfully ran this ddp multi gpu training so that I can make a replica of your environment and try out. Thank you in advance

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Jul 6, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 16, 2023
@glenn-jocher
Copy link
Member

@cool112624 Hello,

I appreciate your thorough testing. Unfortunately, I cannot provide an environment where I've personally tested DDP multi-GPU training, as our testing and development environments are diverse and may not be reproducible in a generic setting. However, I encourage you to refer to the official YOLOv5 documentation and community forums for successful case studies and potential environment configurations. Collaborating with the YOLO community or reaching out to fellow users who have experience in DDP multi-GPU training could also be beneficial.

Should you have any further questions or concerns, please don't hesitate to ask.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

2 participants