You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Image sizes 576 train, 576 val
Using 8 dataloader workers
Logging results to runs/train/exp5
Starting training for 60 epochs...
Epoch gpu_mem box obj cls labels img_size
0/59 10.6G 0.1132 0.02993 0 29 576: 3%|██▎ | 1/39 [00:10<06:48, 10.75s/it]Reducer buckets have been rebuilt in this iteration.
0/59 10.6G 0.09944 0.03163 0 19 576: 100%|████████████████████████████████████████████████████████████████████████████████████████| 39/39 [01:38<00:00, 2.54s/it]
Class Images Labels P R mAP@.5 mAP@.5:.95: 3%|█▊ | 1/39 [00:05<03:43, 5.89s/it]Traceback (most recent call last):
File "train.py", line 620, in
main(opt)
File "train.py", line 518, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 312, in train
pred = model(imgs) # forward
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/yolo.py", line 123, in forward
return self.forward_once(x, profile, visualize) # single-scale inference, train
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/yolo.py", line 155, in forward_once
x = m(x) # run
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 137, in forward
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 103, in forward
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 45, in forward
return self.act(self.bn(self.conv(x)))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 113, in forward
self.num_batches_tracked = self.num_batches_tracked + 1 # type: ignore
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa2a2d962f2 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x7fa2a2d9367b in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x809 (0x7fa2a2fee1f9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fa2a2d7e3a4 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x2f9 (0x7fa316ec0ac9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7fa316eb5a8a in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fa316edcd22 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fa316818df6 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa2201f (0x7fa316ee001f in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x369f00 (0x7fa316827f00 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x36b16e (0x7fa31682916e in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0xfa96c (0x560dc73e296c in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #12: + 0x18f2f5 (0x560dc74772f5 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #13: + 0xfaef8 (0x560dc73e2ef8 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #14: + 0xfd538 (0x560dc73e5538 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #15: + 0xfd5d9 (0x560dc73e55d9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #16: + 0xfd5d9 (0x560dc73e55d9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #17: PyDict_SetItemString + 0x401 (0x560dc74893d1 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #18: PyImport_Cleanup + 0xa4 (0x560dc75574e4 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #19: Py_FinalizeEx + 0x7a (0x560dc7557a9a in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #20: Py_RunMain + 0x1b8 (0x560dc755c5c8 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #21: Py_BytesMain + 0x39 (0x560dc755c939 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #22: __libc_start_main + 0xf3 (0x7fa31e2ce0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: + 0x1e8f39 (0x560dc74d0f39 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
Killing subprocess 160871
Killing subprocess 160872
Traceback (most recent call last):
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3', '-u', 'train.py', '--local_rank=1', '--batch', '32', '--data', 'coco.yaml', '--weights', 'yolov5x.pt', '--device', '0,1', '--imgsz', '560', '--cfg', 'yolov5x.yaml']' died with <Signals.SIGABRT: 6>.
how could I solve this problem
The text was updated successfully, but these errors were encountered:
@coallar your command seems fine though --cfg yolov5x.yaml is redundant with your --weights. For best Multi-GPU performance we always recommend training DDP inside our Docker Image.
Environments
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
YSYTERM: ubuntu20.04
driver info:
CUDA:0 (NVIDIA GeForce GTX TITAN X, 12204.4375MB)
CUDA:1 (NVIDIA GeForce GTX TITAN X, 12212.875MB)
torch&cuda info
torch.version ====> '1.8.0+cu111'
Conmand: python3 -m torch.distributed.launch --nproc_per_node 2 train.py --batch 32 --data coco.yaml --weights yolov5x.pt --device 0,1 --imgsz 560 --cfg yolov5x.yaml
error:
Image sizes 576 train, 576 val
Using 8 dataloader workers
Logging results to runs/train/exp5
Starting training for 60 epochs...
File "train.py", line 620, in
main(opt)
File "train.py", line 518, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 312, in train
pred = model(imgs) # forward
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/yolo.py", line 123, in forward
return self.forward_once(x, profile, visualize) # single-scale inference, train
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/yolo.py", line 155, in forward_once
x = m(x) # run
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 137, in forward
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 103, in forward
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 45, in forward
return self.act(self.bn(self.conv(x)))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 113, in forward
self.num_batches_tracked = self.num_batches_tracked + 1 # type: ignore
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa2a2d962f2 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x7fa2a2d9367b in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x809 (0x7fa2a2fee1f9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fa2a2d7e3a4 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x2f9 (0x7fa316ec0ac9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7fa316eb5a8a in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fa316edcd22 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fa316818df6 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa2201f (0x7fa316ee001f in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x369f00 (0x7fa316827f00 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x36b16e (0x7fa31682916e in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0xfa96c (0x560dc73e296c in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #12: + 0x18f2f5 (0x560dc74772f5 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #13: + 0xfaef8 (0x560dc73e2ef8 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #14: + 0xfd538 (0x560dc73e5538 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #15: + 0xfd5d9 (0x560dc73e55d9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #16: + 0xfd5d9 (0x560dc73e55d9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #17: PyDict_SetItemString + 0x401 (0x560dc74893d1 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #18: PyImport_Cleanup + 0xa4 (0x560dc75574e4 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #19: Py_FinalizeEx + 0x7a (0x560dc7557a9a in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #20: Py_RunMain + 0x1b8 (0x560dc755c5c8 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #21: Py_BytesMain + 0x39 (0x560dc755c939 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #22: __libc_start_main + 0xf3 (0x7fa31e2ce0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: + 0x1e8f39 (0x560dc74d0f39 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
Killing subprocess 160871
Killing subprocess 160872
Traceback (most recent call last):
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3', '-u', 'train.py', '--local_rank=1', '--batch', '32', '--data', 'coco.yaml', '--weights', 'yolov5x.pt', '--device', '0,1', '--imgsz', '560', '--cfg', 'yolov5x.yaml']' died with <Signals.SIGABRT: 6>.
how could I solve this problem
The text was updated successfully, but these errors were encountered: