Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run tutorial: RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM #1469

Closed
nguyen14ck opened this issue Nov 22, 2020 · 20 comments
Closed

Run tutorial: RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM #1469

nguyen14ck opened this issue Nov 22, 2020 · 20 comments
Labels
bug Something isn't working Stale Stale and schedule for closing soon

Comments

@nguyen14ck
Copy link

The issue #185 was closed.

So I open this

🐛 Bug

...
Starting training for 3 epochs...

 Epoch   gpu_mem       box       obj       cls     total   targets  img_size

0%| | 0/8 [00:02<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 490, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 292, in train
scaler.scale(loss).backward()
File "/home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
Exception raised from operator() at /opt/conda/conda-bld/pytorch_1595629416375/work/aten/src/ATen/native/cudnn/Conv.cpp:1141 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f9ff06da77d in

To Reproduce (REQUIRED)

# Train YOLOv5s on COCO128 for 3 epochs
!python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache

Output:

Model Summary: 283 layers, 7468157 parameters, 7468157 gradients

Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning labels data/coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 128it [00:00, 9818.95it/s]
Caching images (0.1GB): 100%|███████████████| 128/128 [00:00<00:00, 1223.42it/s]
Scanning labels data/coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 128it [00:00, 9018.49it/s]
Caching images (0.1GB): 100%|████████████████| 128/128 [00:00<00:00, 562.80it/s]

Analyzing anchors... anchors/target = 4.26, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp3
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0%|                                                     | 0/8 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 490, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 292, in train
    scaler.scale(loss).backward()
  File "/home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
Exception raised from operator() at /opt/conda/conda-bld/pytorch_1595629416375/work/aten/src/ATen/native/cudnn/Conv.cpp:1141 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f9ff06da77d in /home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcadca2 (0x7f9f79915ca2 in /home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xcafe05 (0x7f9f79917e05 in /home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xcb06ce (0x7f9f799186ce in /home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xcb0d90 (0x7f9f79918d90 in /home/npnguyen/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_weight(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0x49 (0x7f9f79918fe9 in 

Expected behavior

Fusing layers...
Model Summary: 484 layers, 88922205 parameters, 0 gradients
Scanning labels ../coco/labels/val2017.cache (4952 found, 0 missing, 48 empty, 0 duplicate, for 5000 images): 5000it [00:00, 14785.71it/s]
Class Images Targets P R mAP@.5 mAP@.5:.95: 100% 157/157 [01:30<00:00, 1.74it/s]
all 5e+03 3.63e+04 0.409 0.754 0.672 0.484
Speed: 5.9/2.1/7.9 ms inference/NMS/total per 640x640 image at batch-size 32

Evaluating pycocotools mAP... saving runs/test/exp/yolov5x_predictions.json...
loading annotations into memory...
Done (t=0.43s)

Environment

If applicable, add screenshots to help explain your problem.

  • OS: [e.g. Centos 7]
  • GPU [e.g. Quadro RTX 5000]

Additional Information

%pip install -qr requirements.txt  # install dependencies

import torch
from IPython.display import Image, clear_output  # to display images

clear_output()
print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

Setup complete. Using torch 1.6.0 _CudaDeviceProperties(name='Quadro RTX 5000', major=7, minor=5, total_memory=16117MB, multi_processor_count=48)

@nguyen14ck nguyen14ck added the bug Something isn't working label Nov 22, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Nov 22, 2020

Hello @nguyen14ck, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@nguyen14ck install Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7.

@nguyen14ck
Copy link
Author

Thanks, @glenn-jocher.
I installed Python 3.8.5, Pytorch 1.7 and requirements.
But the problem still exists

Epoch 1/1:   0%|        | 0/2699 [00:04<?, ?img/s]
Traceback (most recent call last):
  File "/home/centos_user/Documents/WD/DEEP_LEARNING/Notebooks/work2/yolov4/train_wheat.py", line 700, in <module>
    train(model=model,
  File "/home/centos_user/Documents/WD/DEEP_LEARNING/Notebooks/work2/yolov4/train_wheat.py", line 420, in train
    bboxes_pred = model(images)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/centos_user/Documents/WD/DEEP_LEARNING/Notebooks/work2/yolov4/input/pytorch-YOLOv4/tool/darknet2pytorch.py", line 172, in forward
    x = self.models[ind](x)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/home/centos_user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in _conv_forward
    return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

@glenn-jocher
Copy link
Member

@nguyen14ck I'm not sure exactly what the problem may be. We've had some problems with Anaconda in the past, so one thing I would recommend is for you to simply create a new virtual Python 3.8 environment (venv), clone the latest repo (code changes daily), and pip install -r requirements.txt again.

Other than that it may be an issue with your drivers.

You can always try the docker container as well, as it should completely remove all environment problems.

@nguyen14ck
Copy link
Author

Thanks, @glenn-jocher
That's a new Conda env with Python 3.8, Pytorch 1.7
Cuda 11.1
Cudnn 8.0.5

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
$ ./mnistCUDNN
--
Executing:   mnistCUDNN
cudnnGetVersion()   : 8005 , CUDNN_VERSION from cudnn.h : 8005 (8.0.5)
Host   compiler version : GCC 4.8.5
$ python -c "import torch;from torch.utils.cpp_extension import CUDA_HOME;print(CUDA_HOME);print(torch.cuda.is_available())"
/usr/local/cuda/
True

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 23, 2020

@nguyen14ck sure. We don't have resources to help people with their local environments, this is the reason we offer the four validated environments. I would recommend you start from one of these:

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Dec 24, 2020
@blakeliu
Copy link

blakeliu commented Jan 10, 2021

I meet the same bug if i used two nvidia graph card(gtx2070 and gtx1070ti)

$ python train.py --device 0,1 --img 640 --batch 16 --epochs 5 --data coco128.yaml --weights yolov5s.pt
Using torch 1.7.1+cu110 CUDA:0 (GeForce RTX 2070, 7982MB)
CUDA:1 (GeForce GTX 1070 Ti, 8118MB)

Traceback (most recent call last):
  File "/home/blake/cv/yolo/yolov5/train.py", line 490, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "/home/blake/cv/yolo/yolov5/train.py", line 286, in train
    pred = model(imgs)  # forward
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blake/cv/yolo/yolov5/models/yolo.py", line 121, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "/home/blake/cv/yolo/yolov5/models/yolo.py", line 137, in forward_once
    x = m(x)  # run
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blake/cv/yolo/yolov5/models/common.py", line 70, in forward
    y2 = self.cv2(x)
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/home/blake/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

If I use gtx2070 or gtx1070ti , the program run normally!

My Env:

OS: Ubuntu 18.04.5 LTS

Driver Version: 460.32.03

(torch) blake@workstation:~/cv/yolo/yolov5$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

(torch) blake@workstation:~/cv/yolo/yolov5$ conda list python
# packages in environment at /home/blake/anaconda3/envs/torch:
python                    3.7.9                h7579374_0    defaults
python-dateutil           2.8.1                    pypi_0    pypi


(torch) blake@workstation:~/cv/yolo/yolov5$ conda list torch
torch                     1.7.1+cu110              pypi_0    pypi
torchaudio                0.7.2                    pypi_0    pypi
torchvision               0.8.2+cu110              pypi_0    pypi

(torch) blake@workstation:~/cv/yolo/yolov5$ nvidia-smi 
Sun Jan 10 11:35:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   37C    P8    13W / 180W |    523MiB /  8118MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2070    Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   14C    P8    17W / 175W |     10MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

@glenn-jocher
Copy link
Member

@blakeliu best practices is to only run Multi-GPU with identical cards.

@blakeliu
Copy link

@glenn-jocher Thank your advises.

@tetsu-kikuchi
Copy link

tetsu-kikuchi commented Jun 2, 2021

For your information:
In my case, this error happened when there are multi GPUs in my computer.
When I added --device 0 when I run python train.py, this error did not happen and the code worked correctly.

It seems that it was bad to use different type of GPU. In my case, I used two GPUs :
GeForce GTX 1070 Ti
GeForce RTX 2080 Ti

@glenn-jocher
Copy link
Member

@tetsu-kikuchi interesting, thanks for the feedback! I've been thinking we should default to --device 0 rather than use all devices by default. Do you think this is a good idea?

@tetsu-kikuchi
Copy link

tetsu-kikuchi commented Jun 3, 2021

@glenn-jocher Thank you for your response. Using multi-GPUs sometimes causes unexpected errors, and error messages related to GPU are often hard to find out the reason of the error. So, I think setting --device 0 as a default will be convenient especially for beginners (including me).

@glenn-jocher
Copy link
Member

TODO: Device 0 default rather than all available devices default.

@glenn-jocher glenn-jocher added the TODO High priority items label Jun 3, 2021
@glenn-jocher glenn-jocher reopened this Jun 3, 2021
@github-actions github-actions bot removed the Stale Stale and schedule for closing soon label Jun 4, 2021
@tetsu-kikuchi
Copy link

tetsu-kikuchi commented Jun 7, 2021

Additional information:
Strange things, opposite to the previous case, happened. When I used another machine with two GPUs (with the same yolov5 code), an error about cuDNN happened either when I set --device 0 or --device 1. The error did not happen only when I set --device option as its default (i.e., use multi GPUs).

I paste below the error message when I set --device 0 or --device 1. I slightly customized the yolov5 code for my purpose, only for miscellaneous things mainly in utils/dataset.py.

Traceback (most recent call last):
  File "train.py", line 657, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 408, in train
    scaler.scale(loss).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

The GPU information:

YOLOv5 ? v5.0-54-gf55730e torch 1.8.0 CUDA:0 (GeForce GTX 1080 Ti, 11178.5MB)
                                      CUDA:1 (GeForce GTX 1080 Ti, 11175.375MB)

@glenn-jocher
Copy link
Member

@tetsu-kikuchi since this error originates in torch you should probably raise your issue in the pytorch repository.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 7, 2021

@tetsu-kikuchi also, your YOLOv5 code is very out of date. To update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

@tetsu-kikuchi
Copy link

Thanks for your navigation.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 8, 2021

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Jul 8, 2021
@glenn-jocher glenn-jocher removed the TODO High priority items label Sep 26, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 26, 2021

TODO removed as original issue is now resolved. YOLOv5 training defaults to device 0 if CUDA is available, with multiple CUDA devices or CPU commands available via the --device argument:

python train.py --device 0,1,2,3
python train.py --device cpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

4 participants