Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA backend for the DNN module #14827

Merged
merged 129 commits into from
Oct 21, 2019
Merged

Conversation

YashasSamaga
Copy link
Contributor

@YashasSamaga YashasSamaga commented Jun 18, 2019

More up-to-date info available here (unofficial)


How to use build and use the CUDA backend?

How to use multiple GPUs?

There are many ways to make use of multiple GPUs. Here is one which I think is the safest and the least complex solution. It makes use of the fact that the CUDA runtime library maintains a separate CUDA context for each CPU thread.

Suppose you have N devices.

Create N threads.
Assign a CUDA device to each thread by calling cudaSetDevice or cv::cuda::setDevice in that thread. Each thread is now associated with a device.
You can create any number of cv::dnn::Net objects in any of those threads and the network will use the device associated with that thread for memory and computation.

Benchmarks

Demo Video: https://www.youtube.com/watch?v=ljCfluWYymM

Project summary/benchmarks: https://gist.github.com/YashasSamaga/a84cf2826ab2dc755005321fe17cd15d

Support Matrix for this PR ## Current Support Matrix: (not updated)
Blip Meaning
✔️ supports all the configurations that are supported by all the existing backends (and might support more than what's currently supported)
🔵 partially supported (fallback to CPU for unsupported configurations)
not supported (fallback to CPU)
Layer Status Constraints Notes
Activations ✔️
Batch Normalization ✔️
Blank Layer ✔️
Concat Layer ✔️
Const Layer ✔️
Convolution 2d ✔️ asymmetric padding is disabled in layer constructor but the backend supports it
Convolution 3d ✔️ asymmetric padding is disabled in layer constructor but the backend supports it
Crop and resize
Crop Layer ✔️ forwarded to Slice Layer
Detection Output Layer
Deconvolution 2d 🔵 padding configuration should not lead to extra uneven padding
Deconvolution 3d 🔵 padding configuration should not lead to extra uneven padding
Elementwise Layers ✔️
Eltwise Layer ✔️
Flatten Layer ✔️
Fully Connected Layer ✔️
Input Layer
Interp Layer ✔️
Local Response Normalization ✔️
Max Unpooling 2d ✔️
Max Unpooling 3d ✔️
MVN Layer
Normalize Layer 🔵 Only L1 and L2 norm supported
Padding Layer ✔️
Permute Layer ✔️
Pooling 2d 🔵 Only max and average pooling supported supports asymmetric padding
Pooling 3d 🔵 Only max and average pooling supported supports asymmetric padding
Prior Box Layer ✔️
Proposal Layer
Region Layer ✔️ NMS performed using CPU
Reorg Layer ✔️
Reshape Layer ✔️
Resize Layer ✔️
Scale Layer ✔️
Shift Layer ✔️ forwarded to Scale Layer
Shuffle Channel Layer ✔️
Slice Layer ✔️
Softmax Layer ✔️
Split Layer ✔️
LSTM Layer

Known issues:

  1. Tests for some of the SSD based networks fail on Jetson Nano

References: #14585

Results:

force_builders_only=Custom,linux,docs
buildworker:Custom=linux-4
docker_image:Custom=ubuntu-cuda:18.04

@YashasSamaga YashasSamaga force-pushed the cuda4dnn-csl-low branch 2 times, most recently from 5717c7f to 359bf93 Compare June 18, 2019 13:29
Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress!

Please note, that we usually do not merge large code parts without corresponding tests.
Also we prefer to merge completed tasks instead of some helper parts.

So, consider working on this GSoC task in a single PR (if you don't have another agreement with your mentor).

Some build-related comments are below.

modules/dnn/src/cuda4dnn/csl/cudnn.cpp Outdated Show resolved Hide resolved
modules/dnn/src/cuda4dnn/csl/stream.cpp Outdated Show resolved Hide resolved
modules/dnn/include/opencv2/dnn/csl/cublas.hpp Outdated Show resolved Hide resolved
@YashasSamaga YashasSamaga changed the title add low-level CSL components for cuda4dnnn [WIP] CUDA backend for the DNN module Jun 21, 2019
@YashasSamaga
Copy link
Contributor Author

YashasSamaga commented Jun 21, 2019

Do I have to use CV_OVERRIDE and CV_FINAL? I preassume that they were added for portability but now since both final and override are keywords in C++11, should they be used?

Can I use std::shared_ptr instead of cv::Ptr? There isn't a make_shared equivalent and makePtr doesn't do what std::make_shared does.

Is it fine to force push occasionally when there isn't any dependent stuff like reviews in between?

@alalek
Copy link
Member

alalek commented Jun 21, 2019

CV_OVERRIDE and CV_FINAL

It is used to avoid excessive merge issues from 3.4 branch.
As your code is in master branch only and this problem is not actual, so you can use C++ keywords/modifiers.

use std::shared_ptr instead of cv::Ptr

Feel free to use std::shared_ptr (but it is not supported by bindings generator, so be careful with public API).

makePtr doesn't do what std::make_shared does.

In master branch it is just a wrapper, so it should do the same things.

Is it fine to force push

It is OK.
Also rebasing is preferred over "merge" commits (it is easy to do that using 1 squashed commit: squash first, then rebase).

@YashasSamaga YashasSamaga force-pushed the cuda4dnn-csl-low branch 3 times, most recently from 79c65f0 to 2941d74 Compare July 2, 2019 06:29
@davisking
Copy link

Seems like it would be implementation defined at worst, rather than UB. You sure it’s UB? If it’s ok in c++17 and works in our case I think it’s fine. I would be surprised if some compilers defined std::iterator_traits<T>::iterator_category for non iterators in c++11.

@YashasSamaga YashasSamaga force-pushed the cuda4dnn-csl-low branch 2 times, most recently from a818297 to 3584d72 Compare July 25, 2019 18:45
@YashasSamaga YashasSamaga force-pushed the cuda4dnn-csl-low branch 2 times, most recently from 9279922 to becb664 Compare August 7, 2019 05:50
@MonocleSecurity
Copy link

@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat.

@YashasSamaga
Copy link
Contributor Author

@molyswu The exception will have a fairly detailed message stating what exactly caused the exception to be raised. If you still haven't solved, I think your question will be more suitable for answers.opencv.org.

@MonocleSecurity I have opened an issue for discussion. There are a few problems which need to be sorted out.

@pgirgis
Copy link

pgirgis commented Jan 28, 2020

Tested on Jetson TX2 running JetPack 4.2 CUDA 10.0 and cuDNN 7.6.5 and works fine. The above test using YOLOv3 delivers about 5 FPS which is substantial over CPU which was running about 0.33FPS.

@YashasSamaga
Copy link
Contributor Author

@pgirgis what was the size of the input image you used?

@pgirgis
Copy link

pgirgis commented Jan 28, 2020

The original image was 872x586. I resized the input image from to 416x416.

I tested with CUDA_FP16 and got slightly higher results (6FPS).
https://miro.medium.com/max/1744/1*EYFejGUjvjPcc4PZTwoufw.jpeg

The TX2, with only 256 CUDA Cores, this is what I was roughly expecting. I get 7FPS when using FP16

@isra60
Copy link

isra60 commented Jan 28, 2020

@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat.

Do you have any tutorial or example???

Also it could be a good improvement it this feature request is implemented
#15999

Now I have really good performance by using a gstreamer pipeline with the new deepstream hardware decoder from NVIDIA.

@pgirgis
Copy link

pgirgis commented Jan 28, 2020

@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat.

Do you have any tutorial or example???

Also it could be a good improvement it this feature request is implemented
#15999

Now I have really good performance by using a gstreamer pipeline with the new deepstream hardware decoder from NVIDIA.

I was getting 300FPS on the TX2 using VideoCapture on 4.2.0. I have not checked into it yet (just a side test) but assume it has been GPU enabled. This is comparable to the Nvidia SDK I believe.

@isra60
Copy link

isra60 commented Jan 28, 2020

@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat.

Do you have any tutorial or example???
Also it could be a good improvement it this feature request is implemented
#15999
Now I have really good performance by using a gstreamer pipeline with the new deepstream hardware decoder from NVIDIA.

I was getting 300FPS using VideoCapture on 4.2.0. I have not checked into it yet (just a side test) but assume it has been GPU enabled. This is comparable to the Nvidia SDK I believe.

But using the standard VideoCapture?? Which backend?? ffmpeg?? AFIAK ffmpeg is not accelerated on a Jetson Tx2 (maybe there is an update)

@MohamedAliRashad
Copy link

Does someone knows a check to know if my build has a cuda acceleration support or not ?

I found this, but it's for windows and i am using Ubuntu 18.04.

@alalek
Copy link
Member

alalek commented Feb 18, 2020

You should check getBuildInformation() output.

@molyswu
Copy link

molyswu commented Feb 19, 2020

Hi, What's the reason?
VS2017 +OPENCV4.2.0 +cuda10.0+cudnn7.6.5 what run yolov3 model tested bug :

net.setPreferableBackend(DNN_BACKEND_CUDA);
net.setPreferableTarget(DNN_TARGET_CUDA);
0x00007FFD17B04048(located at test2.exe )There are unhandled exceptions: Microsoft C++ exceptions: cv::dnn::cuda4dnn::csl::CUDAException,In memory location 0x000000FE94BADDD0 。

print as follows error at test2.exe
OpenCV(4.2.0) Error: Parsing error (Failed to parse NetParameter file: yolov3_1_80000.weights) in cv::dnn::dnn4_v20191202::readNetFromDarknet, file E:\tools\opencv-4.2.0\modules\dnn\src\darknet\darknet_importer.cpp, line 214
.\pei010718
[ INFO:0] global E:\tools\opencv-4.2.0\modules\core\src\ocl.cpp (891) cv::ocl::haveOpenCL Initialize OpenCL runtime...
[ INFO:0] global E:\tools\opencv-4.2.0\modules\dnn\src\dnn.cpp (2204) cv::dnn::dnn4_v20191202::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "_input" of type NetInputLayer

Thanks!

@andyrey
Copy link

andyrey commented Feb 21, 2020

Hello, Yashas, I failed to build in Windows your code, bcz can't find opencv_dnn420.lib. I have built opencv420 with several latest contribs with CMake and VisualStudio 2015, but it never yielded this lib nor dll. Can you tell how to build it or just upload this lib ?

@YashasSamaga
Copy link
Contributor Author

@andyrey
Copy link

andyrey commented Feb 26, 2020

@YashasSamaga Thank you, Yashas! I used your your sample code and opencv_world.dll from your recommended reference (pre-built case), and achieved 2 ms/frame with Yolo-tiny configuration with 320x320 input blob! Before I had 28-32, it is magic, great work!

@tuteming
Copy link

tuteming commented Mar 9, 2020

my config is opencv 4.2 ,contri , 1080ti and compile with cuda by cmake and ms2015
(cuda version 10.0 cudnn 7.4.2
all is ok, but
I run your code yolov3_opencv_dnn_cuda.cpp
get
[ WARN:0] global E:\opencv_cuda_4.2\opencv-4.2.0\modules\dnn\src\dnn.cpp (1363) cv::dnn::dnn4_v20191202::Net::Impl::setUpNet DNN module was not built with CUDA backend; switching to CPU
run on cpu mode, not gpu mode,can you tell me what is my trouble?
thanks

@andyrey
Copy link

andyrey commented Mar 11, 2020

I have met strange phenomena: when I use cv::imshow ("Window_name", frame_show), my processing time is ~9 ms, when I remove graphical output, having commented this function, proc time increases (!) up to ~15 ms. I can't understand, why, before using opencv 420 version I removed the output to decrease proc time!
I work in Windows 10, yolo-tiny 320x320 input.

@YashasSamaga
Copy link
Contributor Author

YashasSamaga commented Mar 11, 2020

@tuteming https://www.pyimagesearch.com/2020/02/03/how-to-use-opencvs-dnn-module-with-nvidia-gpus-cuda-and-cudnn/

@andyrey

  1. Were you using CPU for inference prior to 4.2.0 and GPU since 4.2.0?
  2. Can you share an overall structure of your code around cv::imshow? Is it in a loop?
  3. What do you mean by proc time?

I am going to take a guess here but I think it might have to do with your CPU inference being treated as compute-bound process and GPU inference being treated as an IO-bound process (at least from the scheduler's perspective). OSes generally have different scheduling policies for IO-bound and CPU bound processes.

@andyrey
Copy link

andyrey commented Mar 11, 2020

@YashasSamaga
1.I used Darknet YOLO based inference before and opencv330, but since your current release overdid, I implemented my code
based on this. In former case I alwais got time decreasing when switch out opencv graphical output.
2.my cv::imshow is at the end of framely loop, in main(). I
I use:
my_draw(frame_show, my_parameters...);
cv::imshow("My window",frame_show);

I use your postprocess(Mat& frame,...) having removed the loop
for (size_t i = 0; i < indices.size(); ++i)
{
...
drawPred(classIds[idx],...
...
}

bcz now I use your code not in main().
and using my own drawing in main().
But, if I remove my drawing and switch yours on, I observe the same effect!
3. Proc time= processing time, sorry for abbreviation.

@andyrey
Copy link

andyrey commented Mar 11, 2020

@YashasSamaga
Yashas, I returned from my may be crooked code to your original one and made same experiment.
Results by video of 3200 frames:
Average time ms/frame 2.55061 when I removes cv::imshow(...) in your code,
Average time ms/frame 2.13507 with it.
Again, graphical output inserting makes program run faster, it is strange..

@GTRwolf
Copy link

GTRwolf commented Mar 13, 2020

@tuteming
Did you solve it? I also face this problem.

@andyrey
Copy link

andyrey commented Mar 13, 2020

@tuteming
@GTRwolf
I had the same problem. In Windows 10, VS2015, C++.
Doesn't matter, did you succeed in building the proper dll from opencv420 (I failed).
But, following Yashas advice, I took one pre-built (for VS2019 is OK either) from
https://jamesbowley.co.uk/accelerate-opencv-4-2-0-build-with-cuda-and-python-bindings/
take there and put opencv_world420.dll (it's weight 647277 Kb) in your working dir, and run! If you have CUDA and cudnn installed in your comp, your program will invoke your Nvidia GPU.

@ccl-private
Copy link

ccl-private commented Mar 25, 2020

my config is opencv 4.2 ,contri , 1080ti and compile with cuda by cmake and ms2015
(cuda version 10.0 cudnn 7.4.2
all is ok, but
I run your code yolov3_opencv_dnn_cuda.cpp
get
[ WARN:0] global E:\opencv_cuda_4.2\opencv-4.2.0\modules\dnn\src\dnn.cpp (1363) cv::dnn::dnn4_v20191202::Net::Impl::setUpNet DNN module was not built with CUDA backend; switching to CPU
run on cpu mode, not gpu mode,can you tell me what is my trouble?
thanks

That is because your cudnn version is unsuitble. Check your cmake log, and it will till you that cudnn version should be at least 7.5.

@qlong1505
Copy link

Is there anyone test the opencv dnn with cuda backend on Jetson nano? I used https://gist.github.com/YashasSamaga/6d37bc403c0934329b078b4bad98c7f2 script and compiled successfully. But when I tested it show the error message

what(): OpenCV(4.3.0) /home/user/opencv/modules/core/src/cuda_info.cpp:62: error: (-217:Gpu API call) unknown error in function 'getCudaEnabledDeviceCount'

@dkurt
Copy link
Member

dkurt commented Apr 14, 2020

@qlong1505, please use a forum for usage question: https://answers.opencv.org/questions/. This PR has already 236 messages.

@opencv opencv locked as too heated and limited conversation to collaborators Apr 14, 2020
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
CUDA backend for the DNN module

* stub cuda4dnn design

* minor fixes for tests and doxygen

* add csl public api directory to module headers

* add low-level CSL components

* add high-level CSL components

* integrate csl::Tensor into backbone code

* switch to CPU iff unsupported; otherwise, fail on error

* add fully connected layer

* add softmax layer

* add activation layers

* support arbitary rank TensorDescriptor

* pass input wrappers to `initCUDA()`

* add 1d/2d/3d-convolution

* add pooling layer

* reorganize and refactor code

* fixes for gcc, clang and doxygen; remove cxx14/17 code

* add blank_layer

* add LRN layer

* add rounding modes for pooling layer

* split tensor.hpp into tensor.hpp and tensor_ops.hpp

* add concat layer

* add scale layer

* add batch normalization layer

* split math.cu into activations.cu and math.hpp

* add eltwise layer

* add flatten layer

* add tensor transform api

* add asymmetric padding support for convolution layer

* add reshape layer

* fix rebase issues

* add permute layer

* add padding support for concat layer

* refactor and reorganize code

* add normalize layer

* optimize bias addition in scale layer

* add prior box layer

* fix and optimize normalize layer

* add asymmetric padding support for pooling layer

* add event API

* improve pooling performance for some padding scenarios

* avoid over-allocation of compute resources to kernels

* improve prior box performance

* enable layer fusion

* add const layer

* add resize layer

* add slice layer

* add padding layer

* add deconvolution layer

* fix channelwise  ReLU initialization

* add vector traits

* add vectorized versions of relu, clipped_relu, power

* add vectorized concat kernels

* improve concat_with_offsets performance

* vectorize scale and bias kernels

* add support for multi-billion element tensors

* vectorize prior box kernels

* fix address alignment check

* improve bias addition performance of conv/deconv/fc layers

* restructure code for supporting multiple targets

* add DNN_TARGET_CUDA_FP64

* add DNN_TARGET_FP16

* improve vectorization

* add region layer

* improve tensor API, add dynamic ranks

1. use ManagedPtr instead of a Tensor in backend wrapper
2. add new methods to tensor classes
  - size_range: computes the combined size of for a given axis range
  - tensor span/view can be constructed from a raw pointer and shape
3. the tensor classes can change their rank at runtime (previously rank was fixed at compile-time)
4. remove device code from tensor classes (as they are unused)
5. enforce strict conditions on tensor class APIs to improve debugging ability

* fix parametric relu activation

* add squeeze/unsqueeze tensor API

* add reorg layer

* optimize permute and enable 2d permute

* enable 1d and 2d slice

* add split layer

* add shuffle channel layer

* allow tensors of different ranks in reshape primitive

* patch SliceOp to allow Crop Layer

* allow extra shape inputs in reshape layer

* use `std::move_backward` instead of `std::move` for insert in resizable_static_array

* improve workspace management

* add spatial LRN

* add nms (cpu) to region layer

* add max pooling with argmax ( and a fix to limits.hpp)

* add max unpooling layer

* rename DNN_TARGET_CUDA_FP32 to DNN_TARGET_CUDA

* update supportBackend to be more rigorous

* remove stray include from preventing non-cuda build

* include op_cuda.hpp outside condition #if

* refactoring, fixes and many optimizations

* drop DNN_TARGET_CUDA_FP64

* fix gcc errors

* increase max. tensor rank limit to six

* add Interp layer

* drop custom layers; use BackendNode

* vectorize activation kernels

* fixes for gcc

* remove wrong assertion

* fix broken assertion in unpooling primitive

* fix build errors in non-CUDA build

* completely remove workspace from public API

* fix permute layer

* enable accuracy and perf. tests for DNN_TARGET_CUDA

* add asynchronous forward

* vectorize eltwise ops

* vectorize fill kernel

* fixes for gcc

* remove CSL headers from public API

* remove csl header source group from cmake

* update min. cudnn version in cmake

* add numerically stable FP32 log1pexp

* refactor code

* add FP16 specialization to cudnn based tensor addition

* vectorize scale1 and bias1 + minor refactoring

* fix doxygen build

* fix invalid alignment assertion

* clear backend wrappers before allocateLayers

* ignore memory lock failures

* do not allocate internal blobs

* integrate NVTX

* add numerically stable half precision log1pexp

* fix indentation, following coding style,  improve docs

* remove accidental modification of IE code

* Revert "add asynchronous forward"

This reverts commit 1154b9d.

* [cmake] throw error for unsupported CC versions

* fix rebase issues

* add more docs, refactor code, fix bugs

* minor refactoring and fixes

* resolve warnings/errors from clang

* remove haveCUDA() checks from supportBackend()

* remove NVTX integration

* changes based on review comments

* avoid exception when no CUDA device is present

* add color code for CUDA in Net::dump
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.