Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaErrorInvalidConfiguration in FusedBatchNormV3 #7316

Closed
bdnkth opened this issue Apr 12, 2021 · 14 comments · Fixed by #7329
Closed

cudaErrorInvalidConfiguration in FusedBatchNormV3 #7316

bdnkth opened this issue Apr 12, 2021 · 14 comments · Fixed by #7329
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider

Comments

@bdnkth
Copy link

bdnkth commented Apr 12, 2021

Describe the bug
I have custom trained the EfficientDetD0 model from the TensorFlow model zoo for object detection and exported the model to onnx with tf2onnx using opset 11 and fixed input of [1,512,512,3].
Using that onnx model in the onnx runtime for C++ I run into a cudaErrorInvalidConfiguration in FusedBatchNormV3_528 of the EfficientDet.
Full errr message is:
2021-04-12 14:58:25.6606830 [E:onnxruntime:, sequential_executor.cc:339 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running Transpose node. Name:'StatefulPartitionedCall/EfficientDet-D0/bifpn/node_03/1_dn_lvl_5/input_0_up_lvl_5/1x1_pre_sample/batchnorm/FusedBatchNormV3__528' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument

Our old onnx models which were made in TF 1.15 are running without an error through the code. TF 2.4.0 models do not.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • ONNX Runtime installed from (source or binary): binary
  • ONNX Runtime version: 1.7.0
  • Python version: 2.4.1
  • Visual Studio version (if applicable): 16.9.3
  • GCC/Compiler version (if compiling from source): v142
  • CUDA/cuDNN version: 11.0
  • GPU model and memory: RTX 3090 24gb

To Reproduce

  • The EfficientDetD0 onnx model is attached: model.zip

Expected behavior
There should be no cudaErrorInvalidConfiguration

@hariharans29
Copy link
Member

I didn't find a transpose that ended with __528. Instead I found one ending with __527 so I am assuming that is the one.

It seems like the Transpose works with 5D tensors and so I am guessing it enters the non-cublas/non-3d/non-4d implementation of ours (the "generic" path -

TArray<int64_t> input_strides(new_rank);
). I wonder if it is triggering something here.

@hariharans29 hariharans29 added ep:CUDA issues related to the CUDA execution provider type:bug labels Apr 12, 2021
@hariharans29
Copy link
Member

Btw - can you run the model with just cpu and check if the results are okay - that will rule out any model issues and to prioritize accordingly - is there a timeline you are looking at to have these resolved ?

@hariharans29 hariharans29 self-assigned this Apr 13, 2021
@bdnkth
Copy link
Author

bdnkth commented Apr 13, 2021

Hi,
thank you for the fast reply. Indeed, it is __527, I mixed something with the __528 which comes from a different input layer which I did some experiments with to exclude the input size as the root of this problem.

The model runs on the CPU without an error, only the CUDA execution fails with the cudaErrorInvalidConfiguration.
Regarding the timeline: At the moment the CPU version is sufficient to test our pipeline. However, runtime is crucial for us, so we have to use to CUDA version as CPU execution time is too high. We have to make runtime benchmarks in the near future to see if our pipeline and our models stay below our maximum execution time. So it's not super urgent for our immediate timeline, but important for the near future.

@hariharans29
Copy link
Member

I figured out the issue - it triggers a corner case bug in our Transpose CUDA implementation.

I ll need to think about the fix and I ll send it out shortly.

@hariharans29
Copy link
Member

Can you build from source and make sure the bug fix works with all your models

@bdnkth
Copy link
Author

bdnkth commented Apr 14, 2021

The bug fix works and the models run without an error. However, if the onnx model is executed on CPU and GPU the results are not the same for the same input. They differ up to 10%. Furthermore, GPU execution has some deviations in its results, results differ up to 5% in different runs on the same input. CPU version has the same results as the orignal Tensorflow model executed in Tensorflow on the same input.

I'm currently investigating if that problem is caused by our code or the problem is caused by something else.

I'm providing some output results to show the problem:
expected output (CPU and Tensorflow output): 0.0774431 ; 0.242312; 1; 0.891305

  1. GPU run: 0; 0.236901; 0.919738; 1
  2. GPU run: 0; 0.25514; 0.865623; 1
  3. GPU run: 0; 0.241076; 0.908444; 1
  4. GPU run: 0.0508458; 0.164875; 0.937259; 1

Did some further expirements to find the root. The coordinates for the bounding boxes are different because the scores for each bounding box is completly different.

CPU: 1; 0.003711917; 0.0343419; 0.0322074
GPU: 0.727154; 0.408029; 0.333677; 0.311356

So the CPU execution is super confident abouts its prediction of the bounding boxes, while the GPU version is not. If the model is executed in Tensorflow on the GPU, it has the same results as onnx CPU.

Further addition:
I've done some checks with a different model, which was written from scratch in Keras and Tensorflow 1.15. That model produces the same output on CPU, CUDA and TensorRT.

@hariharans29
Copy link
Member

Thanks for taking the time to verify the change.

It could be that my fix has a bug - I ll get back to you on this.

@bdnkth
Copy link
Author

bdnkth commented Apr 15, 2021

Could this be caused by the NonMaxSuppression? The net should only detect one bounding box per image (there is only one object per image in our dataset at the moment). Since the net detects multiple objects when running on the GPU we have the guess, that this might be the problem.

Unfortunately I wanted to do some experiments with a different self trained object detection model from Tensorflow's model zoo but I ran into a bug in tf2onnx.

@bdnkth
Copy link
Author

bdnkth commented Apr 22, 2021

Do you have any updates regarding this problem? Or are there additional information you need from me?

@hariharans29
Copy link
Member

Thanks for the reminder. I think I have all that I need from you. I need to refine my change a bit but I have been busy with some high priority work-items recently. I will make sure a fix makes it into our next release.

@hariharans29
Copy link
Member

Hi @bdnkth - Can you try re-building the fix branch and testing your models again ? Thanks.

@bdnkth
Copy link
Author

bdnkth commented May 27, 2021

The EfficientDet runs now as expected on the GPU. I get the same results as with the CPU execution and execution time is fine. I had some runtime problems yesterday with cuda, however it is fixed now. Some dll was not built properly, I don't if that was a problem by my build pipeline or your last commits fixed that.
Thank you for providing a working fix for the problem.

@hariharans29
Copy link
Member

Thanks for testing the fix.

@zx-lhb
Copy link

zx-lhb commented May 18, 2023

hi, i met the same problem when use onnx model converted by tf2onn to do inference, how can i fix it, can you help me?

2023-05-18 14:21:33.2750436 [E:onnxruntime:, sequential_executor.cc:346 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running Transpose node. Name:'StatefulPartitionedCall/attnGateVnet3d/conv3d_48/Conv3D__890' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants