-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudaErrorInvalidConfiguration in FusedBatchNormV3 #7316
Comments
I didn't find a transpose that ended with __528. Instead I found one ending with __527 so I am assuming that is the one. It seems like the Transpose works with 5D tensors and so I am guessing it enters the non-cublas/non-3d/non-4d implementation of ours (the "generic" path -
|
Btw - can you run the model with just cpu and check if the results are okay - that will rule out any model issues and to prioritize accordingly - is there a timeline you are looking at to have these resolved ? |
Hi, The model runs on the CPU without an error, only the CUDA execution fails with the cudaErrorInvalidConfiguration. |
I figured out the issue - it triggers a corner case bug in our Transpose CUDA implementation. I ll need to think about the fix and I ll send it out shortly. |
Can you build from source and make sure the bug fix works with all your models |
The bug fix works and the models run without an error. However, if the onnx model is executed on CPU and GPU the results are not the same for the same input. They differ up to 10%. Furthermore, GPU execution has some deviations in its results, results differ up to 5% in different runs on the same input. CPU version has the same results as the orignal Tensorflow model executed in Tensorflow on the same input. I'm currently investigating if that problem is caused by our code or the problem is caused by something else. I'm providing some output results to show the problem:
Did some further expirements to find the root. The coordinates for the bounding boxes are different because the scores for each bounding box is completly different. CPU: 1; 0.003711917; 0.0343419; 0.0322074 So the CPU execution is super confident abouts its prediction of the bounding boxes, while the GPU version is not. If the model is executed in Tensorflow on the GPU, it has the same results as onnx CPU. Further addition: |
Thanks for taking the time to verify the change. It could be that my fix has a bug - I ll get back to you on this. |
Could this be caused by the NonMaxSuppression? The net should only detect one bounding box per image (there is only one object per image in our dataset at the moment). Since the net detects multiple objects when running on the GPU we have the guess, that this might be the problem. Unfortunately I wanted to do some experiments with a different self trained object detection model from Tensorflow's model zoo but I ran into a bug in tf2onnx. |
Do you have any updates regarding this problem? Or are there additional information you need from me? |
Thanks for the reminder. I think I have all that I need from you. I need to refine my change a bit but I have been busy with some high priority work-items recently. I will make sure a fix makes it into our next release. |
Hi @bdnkth - Can you try re-building the fix branch and testing your models again ? Thanks. |
The EfficientDet runs now as expected on the GPU. I get the same results as with the CPU execution and execution time is fine. I had some runtime problems yesterday with cuda, however it is fixed now. Some dll was not built properly, I don't if that was a problem by my build pipeline or your last commits fixed that. |
Thanks for testing the fix. |
hi, i met the same problem when use onnx model converted by tf2onn to do inference, how can i fix it, can you help me? 2023-05-18 14:21:33.2750436 [E:onnxruntime:, sequential_executor.cc:346 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running Transpose node. Name:'StatefulPartitionedCall/attnGateVnet3d/conv3d_48/Conv3D__890' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument |
Describe the bug
I have custom trained the EfficientDetD0 model from the TensorFlow model zoo for object detection and exported the model to onnx with tf2onnx using opset 11 and fixed input of [1,512,512,3].
Using that onnx model in the onnx runtime for C++ I run into a cudaErrorInvalidConfiguration in FusedBatchNormV3_528 of the EfficientDet.
Full errr message is:
2021-04-12 14:58:25.6606830 [E:onnxruntime:, sequential_executor.cc:339 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running Transpose node. Name:'StatefulPartitionedCall/EfficientDet-D0/bifpn/node_03/1_dn_lvl_5/input_0_up_lvl_5/1x1_pre_sample/batchnorm/FusedBatchNormV3__528' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument
Our old onnx models which were made in TF 1.15 are running without an error through the code. TF 2.4.0 models do not.
System information
To Reproduce
Expected behavior
There should be no cudaErrorInvalidConfiguration
The text was updated successfully, but these errors were encountered: