Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: DataLoader worker (pid(s) 296430) exited unexpectedly #1675

Closed
xuqinghan opened this issue Dec 12, 2020 · 4 comments
Closed

RuntimeError: DataLoader worker (pid(s) 296430) exited unexpectedly #1675

xuqinghan opened this issue Dec 12, 2020 · 4 comments
Labels
bug Something isn't working Stale Stale and schedule for closing soon

Comments

@xuqinghan
Copy link

🐛 Bug

pytorch 1.7.1

train model with 10 worker.

To Reproduce (REQUIRED)

Input:

python train.py --batch 16 --epochs 200 --workers 10 --weights yolov5s.pt --data warship.yaml

Output:

Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 3080', total_memory=10015MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='', data='./data/warship.yaml', device='', epochs=200, evolve=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], local_rank=-1, logdir='runs/', multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov5s.pt', workers=10, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Overriding model.yaml nc=80 with nc=8

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     19904  models.common.BottleneckCSP             [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    161152  models.common.BottleneckCSP             [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    641792  models.common.BottleneckCSP             [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    378624  models.common.BottleneckCSP             [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     95104  models.common.BottleneckCSP             [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    313088  models.common.BottleneckCSP             [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 24      [17, 20, 23]  1     35061  models.yolo.Detect                      [8, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 191 layers, 7.27397e+06 parameters, 7.27397e+06 gradients

Transferred 364/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning labels ../warship/labels.cache (21200 found, 0 missing, 75 empty, 0 duplicate, for 21275 images): 21275it [00:01, 18689.53it/s]
Scanning labels ../warship/labels.cache (21200 found, 0 missing, 75 empty, 0 duplicate, for 21275 images): 21275it [00:01, 18503.85it/s]
Inconsistency detected by ld.so: dl-open.c: 560: dl_open_worker: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!

Analyzing anchors... anchors/target = 4.07, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 10 dataloader workers
Logging results to runs/exp9
Starting training for 200 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
     0/199     5.15G   0.05539   0.02971   0.02252    0.1076        49       640: 100%|███████████████████████████████████████████████████████████| 1330/1330 [03:20<00:00,  6.64it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95:   0%|                                                  | 3/1330 [00:06<44:56,  2.03s/it]
Traceback (most recent call last):
  File "/home/xcsc/anaconda3/envs/yolo_v5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/xcsc/anaconda3/envs/yolo_v5/lib/python3.8/queue.py", line 178, in get
    raise Empty
_queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 460, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 313, in train
    results, maps, times = test.test(opt.data,
  File "/home/xcsc/dev/yolo/yolov5-3.1/test.py", line 95, in test
    for batch_i, (img, targets, paths, shapes) in enumerate(tqdm(dataloader, desc=s)):
  File "/home/xcsc/anaconda3/envs/yolo_v5/lib/python3.8/site-packages/tqdm/std.py", line 1167, in __iter__
    for obj in iterable:
  File "/home/xcsc/dev/yolo/yolov5-3.1/utils/datasets.py", line 91, in __iter__
    yield next(self.iterator)
  File "/home/xcsc/anaconda3/envs/yolo_v5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/xcsc/anaconda3/envs/yolo_v5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
    idx, data = self._get_data()
  File "/home/xcsc/anaconda3/envs/yolo_v5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1024, in _get_data
    success, data = self._try_get_data()
  File "/home/xcsc/anaconda3/envs/yolo_v5/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 296430) exited unexpectedly

Expected behavior

A clear and concise description of what you expected to happen.

Environment

If applicable, add screenshots to help explain your problem.

  • OS: Ubuntu 20.04
  • GPU 3080
    CPU: intel i9-10900 (10c20t)

Additional context

Add any other context about the problem here.

@xuqinghan xuqinghan added the bug Something isn't working label Dec 12, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Dec 12, 2020

Hello @xuqinghan, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@xuqinghan this error indicates you are attempting to use too many workers.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Jan 12, 2021
@Albertwindows
Copy link

Albertwindows commented Jul 19, 2021

Set --workers 0 to disable multiple process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

3 participants