-
-
Notifications
You must be signed in to change notification settings - Fork 55.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA backend for the DNN module #14827
Conversation
5717c7f
to
359bf93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress!
Please note, that we usually do not merge large code parts without corresponding tests.
Also we prefer to merge completed tasks instead of some helper parts.
So, consider working on this GSoC task in a single PR (if you don't have another agreement with your mentor).
Some build-related comments are below.
46db2b1
to
fbd05d3
Compare
Do I have to use Can I use Is it fine to force push occasionally when there isn't any dependent stuff like reviews in between? |
It is used to avoid excessive merge issues from
Feel free to use
In master branch it is just a wrapper, so it should do the same things.
It is OK. |
c8fd75b
to
30b294e
Compare
79c65f0
to
2941d74
Compare
39837c8
to
b89d7e0
Compare
Seems like it would be implementation defined at worst, rather than UB. You sure it’s UB? If it’s ok in c++17 and works in our case I think it’s fine. I would be surprised if some compilers defined |
a818297
to
3584d72
Compare
9279922
to
becb664
Compare
@isra60 The NVIDIA Video Codec SDK allows you to decode on the GPU, which you can then use to create a cv::GpuMat. |
@molyswu The exception will have a fairly detailed message stating what exactly caused the exception to be raised. If you still haven't solved, I think your question will be more suitable for answers.opencv.org. @MonocleSecurity I have opened an issue for discussion. There are a few problems which need to be sorted out. |
Tested on Jetson TX2 running JetPack 4.2 CUDA 10.0 and cuDNN 7.6.5 and works fine. The above test using YOLOv3 delivers about 5 FPS which is substantial over CPU which was running about 0.33FPS. |
@pgirgis what was the size of the input image you used? |
The original image was 872x586. I resized the input image from to 416x416. I tested with CUDA_FP16 and got slightly higher results (6FPS). The TX2, with only 256 CUDA Cores, this is what I was roughly expecting. I get 7FPS when using FP16 |
Do you have any tutorial or example??? Also it could be a good improvement it this feature request is implemented Now I have really good performance by using a gstreamer pipeline with the new deepstream hardware decoder from NVIDIA. |
I was getting 300FPS on the TX2 using VideoCapture on 4.2.0. I have not checked into it yet (just a side test) but assume it has been GPU enabled. This is comparable to the Nvidia SDK I believe. |
But using the standard VideoCapture?? Which backend?? ffmpeg?? AFIAK ffmpeg is not accelerated on a Jetson Tx2 (maybe there is an update) |
Does someone knows a check to know if my build has a cuda acceleration support or not ? I found this, but it's for windows and i am using Ubuntu 18.04. |
You should check |
Hi, What's the reason? net.setPreferableBackend(DNN_BACKEND_CUDA); print as follows error at test2.exe Thanks! |
Hello, Yashas, I failed to build in Windows your code, bcz can't find opencv_dnn420.lib. I have built opencv420 with several latest contribs with CMake and VisualStudio 2015, but it never yielded this lib nor dll. Can you tell how to build it or just upload this lib ? |
@YashasSamaga Thank you, Yashas! I used your your sample code and opencv_world.dll from your recommended reference (pre-built case), and achieved 2 ms/frame with Yolo-tiny configuration with 320x320 input blob! Before I had 28-32, it is magic, great work! |
my config is opencv 4.2 ,contri , 1080ti and compile with cuda by cmake and ms2015 |
I have met strange phenomena: when I use cv::imshow ("Window_name", frame_show), my processing time is ~9 ms, when I remove graphical output, having commented this function, proc time increases (!) up to ~15 ms. I can't understand, why, before using opencv 420 version I removed the output to decrease proc time! |
@tuteming https://www.pyimagesearch.com/2020/02/03/how-to-use-opencvs-dnn-module-with-nvidia-gpus-cuda-and-cudnn/
I am going to take a guess here but I think it might have to do with your CPU inference being treated as compute-bound process and GPU inference being treated as an IO-bound process (at least from the scheduler's perspective). OSes generally have different scheduling policies for IO-bound and CPU bound processes. |
@YashasSamaga I use your postprocess(Mat& frame,...) having removed the loop |
@YashasSamaga |
@tuteming |
@tuteming |
That is because your cudnn version is unsuitble. Check your cmake log, and it will till you that cudnn version should be at least 7.5. |
Is there anyone test the opencv dnn with cuda backend on Jetson nano? I used https://gist.github.com/YashasSamaga/6d37bc403c0934329b078b4bad98c7f2 script and compiled successfully. But when I tested it show the error message
|
@qlong1505, please use a forum for usage question: https://answers.opencv.org/questions/. This PR has already 236 messages. |
CUDA backend for the DNN module * stub cuda4dnn design * minor fixes for tests and doxygen * add csl public api directory to module headers * add low-level CSL components * add high-level CSL components * integrate csl::Tensor into backbone code * switch to CPU iff unsupported; otherwise, fail on error * add fully connected layer * add softmax layer * add activation layers * support arbitary rank TensorDescriptor * pass input wrappers to `initCUDA()` * add 1d/2d/3d-convolution * add pooling layer * reorganize and refactor code * fixes for gcc, clang and doxygen; remove cxx14/17 code * add blank_layer * add LRN layer * add rounding modes for pooling layer * split tensor.hpp into tensor.hpp and tensor_ops.hpp * add concat layer * add scale layer * add batch normalization layer * split math.cu into activations.cu and math.hpp * add eltwise layer * add flatten layer * add tensor transform api * add asymmetric padding support for convolution layer * add reshape layer * fix rebase issues * add permute layer * add padding support for concat layer * refactor and reorganize code * add normalize layer * optimize bias addition in scale layer * add prior box layer * fix and optimize normalize layer * add asymmetric padding support for pooling layer * add event API * improve pooling performance for some padding scenarios * avoid over-allocation of compute resources to kernels * improve prior box performance * enable layer fusion * add const layer * add resize layer * add slice layer * add padding layer * add deconvolution layer * fix channelwise ReLU initialization * add vector traits * add vectorized versions of relu, clipped_relu, power * add vectorized concat kernels * improve concat_with_offsets performance * vectorize scale and bias kernels * add support for multi-billion element tensors * vectorize prior box kernels * fix address alignment check * improve bias addition performance of conv/deconv/fc layers * restructure code for supporting multiple targets * add DNN_TARGET_CUDA_FP64 * add DNN_TARGET_FP16 * improve vectorization * add region layer * improve tensor API, add dynamic ranks 1. use ManagedPtr instead of a Tensor in backend wrapper 2. add new methods to tensor classes - size_range: computes the combined size of for a given axis range - tensor span/view can be constructed from a raw pointer and shape 3. the tensor classes can change their rank at runtime (previously rank was fixed at compile-time) 4. remove device code from tensor classes (as they are unused) 5. enforce strict conditions on tensor class APIs to improve debugging ability * fix parametric relu activation * add squeeze/unsqueeze tensor API * add reorg layer * optimize permute and enable 2d permute * enable 1d and 2d slice * add split layer * add shuffle channel layer * allow tensors of different ranks in reshape primitive * patch SliceOp to allow Crop Layer * allow extra shape inputs in reshape layer * use `std::move_backward` instead of `std::move` for insert in resizable_static_array * improve workspace management * add spatial LRN * add nms (cpu) to region layer * add max pooling with argmax ( and a fix to limits.hpp) * add max unpooling layer * rename DNN_TARGET_CUDA_FP32 to DNN_TARGET_CUDA * update supportBackend to be more rigorous * remove stray include from preventing non-cuda build * include op_cuda.hpp outside condition #if * refactoring, fixes and many optimizations * drop DNN_TARGET_CUDA_FP64 * fix gcc errors * increase max. tensor rank limit to six * add Interp layer * drop custom layers; use BackendNode * vectorize activation kernels * fixes for gcc * remove wrong assertion * fix broken assertion in unpooling primitive * fix build errors in non-CUDA build * completely remove workspace from public API * fix permute layer * enable accuracy and perf. tests for DNN_TARGET_CUDA * add asynchronous forward * vectorize eltwise ops * vectorize fill kernel * fixes for gcc * remove CSL headers from public API * remove csl header source group from cmake * update min. cudnn version in cmake * add numerically stable FP32 log1pexp * refactor code * add FP16 specialization to cudnn based tensor addition * vectorize scale1 and bias1 + minor refactoring * fix doxygen build * fix invalid alignment assertion * clear backend wrappers before allocateLayers * ignore memory lock failures * do not allocate internal blobs * integrate NVTX * add numerically stable half precision log1pexp * fix indentation, following coding style, improve docs * remove accidental modification of IE code * Revert "add asynchronous forward" This reverts commit 1154b9d. * [cmake] throw error for unsupported CC versions * fix rebase issues * add more docs, refactor code, fix bugs * minor refactoring and fixes * resolve warnings/errors from clang * remove haveCUDA() checks from supportBackend() * remove NVTX integration * changes based on review comments * avoid exception when no CUDA device is present * add color code for CUDA in Net::dump
More up-to-date info available here (unofficial)
How to use build and use the CUDA backend?
How to use multiple GPUs?
Benchmarks
Demo Video: https://www.youtube.com/watch?v=ljCfluWYymM
Project summary/benchmarks: https://gist.github.com/YashasSamaga/a84cf2826ab2dc755005321fe17cd15d
Support Matrix for this PR
## Current Support Matrix: (not updated)Known issues:
References: #14585
Results: