cudaErrorInvalidConfiguration in FusedBatchNormV3 #7316

bdnkth · 2021-04-12T14:16:39Z

Describe the bug
I have custom trained the EfficientDetD0 model from the TensorFlow model zoo for object detection and exported the model to onnx with tf2onnx using opset 11 and fixed input of [1,512,512,3].
Using that onnx model in the onnx runtime for C++ I run into a cudaErrorInvalidConfiguration in FusedBatchNormV3_528 of the EfficientDet.
Full errr message is:
2021-04-12 14:58:25.6606830 [E:onnxruntime:, sequential_executor.cc:339 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running Transpose node. Name:'StatefulPartitionedCall/EfficientDet-D0/bifpn/node_03/1_dn_lvl_5/input_0_up_lvl_5/1x1_pre_sample/batchnorm/FusedBatchNormV3__528' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument

Our old onnx models which were made in TF 1.15 are running without an error through the code. TF 2.4.0 models do not.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: 1.7.0
Python version: 2.4.1
Visual Studio version (if applicable): 16.9.3
GCC/Compiler version (if compiling from source): v142
CUDA/cuDNN version: 11.0
GPU model and memory: RTX 3090 24gb

To Reproduce

The EfficientDetD0 onnx model is attached: model.zip

Expected behavior
There should be no cudaErrorInvalidConfiguration

hariharans29 · 2021-04-12T14:57:10Z

I didn't find a transpose that ended with __528. Instead I found one ending with __527 so I am assuming that is the one.

It seems like the Transpose works with 5D tensors and so I am guessing it enters the non-cublas/non-3d/non-4d implementation of ours (the "generic" path -

onnxruntime/onnxruntime/core/providers/cuda/tensor/transpose.cc

Line 179 in 21c282e

TArray<int64_t> input_strides(new_rank);

). I wonder if it is triggering something here.

hariharans29 · 2021-04-12T15:06:00Z

Btw - can you run the model with just cpu and check if the results are okay - that will rule out any model issues and to prioritize accordingly - is there a timeline you are looking at to have these resolved ?

bdnkth · 2021-04-13T08:34:58Z

Hi,
thank you for the fast reply. Indeed, it is __527, I mixed something with the __528 which comes from a different input layer which I did some experiments with to exclude the input size as the root of this problem.

The model runs on the CPU without an error, only the CUDA execution fails with the cudaErrorInvalidConfiguration.
Regarding the timeline: At the moment the CPU version is sufficient to test our pipeline. However, runtime is crucial for us, so we have to use to CUDA version as CPU execution time is too high. We have to make runtime benchmarks in the near future to see if our pipeline and our models stay below our maximum execution time. So it's not super urgent for our immediate timeline, but important for the near future.

hariharans29 · 2021-04-13T12:41:18Z

I figured out the issue - it triggers a corner case bug in our Transpose CUDA implementation.

I ll need to think about the fix and I ll send it out shortly.

hariharans29 · 2021-04-13T13:53:09Z

Can you build from source and make sure the bug fix works with all your models

bdnkth · 2021-04-14T09:18:28Z

The bug fix works and the models run without an error. However, if the onnx model is executed on CPU and GPU the results are not the same for the same input. They differ up to 10%. Furthermore, GPU execution has some deviations in its results, results differ up to 5% in different runs on the same input. CPU version has the same results as the orignal Tensorflow model executed in Tensorflow on the same input.

I'm currently investigating if that problem is caused by our code or the problem is caused by something else.

I'm providing some output results to show the problem:
expected output (CPU and Tensorflow output): 0.0774431 ; 0.242312; 1; 0.891305

GPU run: 0; 0.236901; 0.919738; 1
GPU run: 0; 0.25514; 0.865623; 1
GPU run: 0; 0.241076; 0.908444; 1
GPU run: 0.0508458; 0.164875; 0.937259; 1

Did some further expirements to find the root. The coordinates for the bounding boxes are different because the scores for each bounding box is completly different.

CPU: 1; 0.003711917; 0.0343419; 0.0322074
GPU: 0.727154; 0.408029; 0.333677; 0.311356

So the CPU execution is super confident abouts its prediction of the bounding boxes, while the GPU version is not. If the model is executed in Tensorflow on the GPU, it has the same results as onnx CPU.

Further addition:
I've done some checks with a different model, which was written from scratch in Keras and Tensorflow 1.15. That model produces the same output on CPU, CUDA and TensorRT.

hariharans29 · 2021-04-14T14:14:14Z

Thanks for taking the time to verify the change.

It could be that my fix has a bug - I ll get back to you on this.

bdnkth · 2021-04-15T12:22:48Z

Could this be caused by the NonMaxSuppression? The net should only detect one bounding box per image (there is only one object per image in our dataset at the moment). Since the net detects multiple objects when running on the GPU we have the guess, that this might be the problem.

Unfortunately I wanted to do some experiments with a different self trained object detection model from Tensorflow's model zoo but I ran into a bug in tf2onnx.

bdnkth · 2021-04-22T05:52:58Z

Do you have any updates regarding this problem? Or are there additional information you need from me?

hariharans29 · 2021-04-22T07:39:26Z

Thanks for the reminder. I think I have all that I need from you. I need to refine my change a bit but I have been busy with some high priority work-items recently. I will make sure a fix makes it into our next release.

hariharans29 · 2021-05-25T23:20:45Z

Hi @bdnkth - Can you try re-building the fix branch and testing your models again ? Thanks.

bdnkth · 2021-05-27T08:27:13Z

The EfficientDet runs now as expected on the GPU. I get the same results as with the CPU execution and execution time is fine. I had some runtime problems yesterday with cuda, however it is fixed now. Some dll was not built properly, I don't if that was a problem by my build pipeline or your last commits fixed that.
Thank you for providing a working fix for the problem.

hariharans29 · 2021-05-27T21:03:38Z

Thanks for testing the fix.

zx-lhb · 2023-05-18T07:01:38Z

hi, i met the same problem when use onnx model converted by tf2onn to do inference, how can i fix it, can you help me?

2023-05-18 14:21:33.2750436 [E:onnxruntime:, sequential_executor.cc:346 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running Transpose node. Name:'StatefulPartitionedCall/attnGateVnet3d/conv3d_48/Conv3D__890' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument

hariharans29 added ep:CUDA issues related to the CUDA execution provider type:bug labels Apr 12, 2021

hariharans29 self-assigned this Apr 13, 2021

hariharans29 mentioned this issue Apr 13, 2021

Fix bug in Transpose CUDA kernel #7329

Merged

hariharans29 closed this as completed in #7329 May 27, 2021

flojar mentioned this issue Jun 17, 2021

Large GPU memory usage with EXHAUSTIVE cuDNN search #7612

Open

hariharans29 mentioned this issue Jul 12, 2021

OnnxRuntime Error : sequential_executor.cc:318 Execute] Non-zero status code returned while running Transpose node. && GPU Usage Issue #8352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaErrorInvalidConfiguration in FusedBatchNormV3 #7316

cudaErrorInvalidConfiguration in FusedBatchNormV3 #7316

bdnkth commented Apr 12, 2021

hariharans29 commented Apr 12, 2021

hariharans29 commented Apr 12, 2021

bdnkth commented Apr 13, 2021

hariharans29 commented Apr 13, 2021

hariharans29 commented Apr 13, 2021

bdnkth commented Apr 14, 2021 •

edited

Loading

hariharans29 commented Apr 14, 2021

bdnkth commented Apr 15, 2021

bdnkth commented Apr 22, 2021

hariharans29 commented Apr 22, 2021

hariharans29 commented May 25, 2021

bdnkth commented May 27, 2021 •

edited

Loading

hariharans29 commented May 27, 2021

zx-lhb commented May 18, 2023

cudaErrorInvalidConfiguration in FusedBatchNormV3 #7316

cudaErrorInvalidConfiguration in FusedBatchNormV3 #7316

Comments

bdnkth commented Apr 12, 2021

hariharans29 commented Apr 12, 2021

hariharans29 commented Apr 12, 2021

bdnkth commented Apr 13, 2021

hariharans29 commented Apr 13, 2021

hariharans29 commented Apr 13, 2021

bdnkth commented Apr 14, 2021 • edited Loading

hariharans29 commented Apr 14, 2021

bdnkth commented Apr 15, 2021

bdnkth commented Apr 22, 2021

hariharans29 commented Apr 22, 2021

hariharans29 commented May 25, 2021

bdnkth commented May 27, 2021 • edited Loading

hariharans29 commented May 27, 2021

zx-lhb commented May 18, 2023

bdnkth commented Apr 14, 2021 •

edited

Loading

bdnkth commented May 27, 2021 •

edited

Loading