Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build onnxruntime on arm64 linux with CUDA EP #16263

Closed
GilbertPan97 opened this issue Jun 7, 2023 · 16 comments
Closed

Build onnxruntime on arm64 linux with CUDA EP #16263

GilbertPan97 opened this issue Jun 7, 2023 · 16 comments
Labels
build build issues; typically submitted using template ep:CUDA issues related to the CUDA execution provider feature request request for unsupported feature or enhancement

Comments

@GilbertPan97
Copy link

Describe the issue

I am trying to perform model inference on arm64 linux platform, however, I can't find a pre-build version suitable for gpu running (v1.12.1). Is there any other solution, or what do I need to pay attention to if I want to compile the gpu version of onnxruntime to run on arm64 linux

To reproduce

This is a question about model inference with gpu on arm64 linux platform, I would really appreciate if you could answer it

Urgency

No response

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

C++

Architecture

ARM64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.4

@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jun 7, 2023
@B-LechCode
Copy link

I would suggest to clone this repository, then check out the release suitable for you.
Then build it by yourself, according to: https://onnxruntime.ai/docs/build/inferencing.html

You should take care, to install CUDA and CUDNN properly.
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
Be aware to pick the right CUDA version for your GPU driver and ORT, same applies to CUDNN.

If you install the Nvidia stuff the right way, compiling should be painless despite you having probably a slow CPU.

In the first instance, I'd suggest to build CPU only.
Then I'd build the CUDA-EP.
https://onnxruntime.ai/docs/build/eps.html#cuda

@snnn
Copy link
Member

snnn commented Jun 7, 2023

We don't have such prebuilt packages. CUDA on Linux ARM64 has two variants:

  1. SBSA
  2. Jetson

Though we may add a support for SBSA, we cannot test it. All our build servers are in Azure. Azure doesn't have such SKUs.

So, I would suggest building it from source, and let us know if there was any build error.

@snnn snnn added feature request request for unsupported feature or enhancement build build issues; typically submitted using template labels Jun 7, 2023
@snnn snnn changed the title Build onnxruntime on arm64 linux Build onnxruntime on arm64 linux with CUDA EP Jun 7, 2023
@GilbertPan97
Copy link
Author

GilbertPan97 commented Jun 8, 2023

We don't have such prebuilt packages. CUDA on Linux ARM64 has two variants:

1. SBSA

2. Jetson

Though we may add a support for SBSA, we cannot test it. All our build servers are in Azure. Azure doesn't have such SKUs.

So, I would suggest building it from source, and let us know if there was any build error.

Thank you for your reply. I am trying to compile the onnxruntime-gpu on arm64 linux platform with:

./build.sh --build_shared_lib --config Release --use_cuda --cudnn_home "/usr/local/cuda-11.4" --cuda_home "/usr/local/cuda-11.4"

There is no error reported during the compilation process. However, when I run the test programs afterwards, it shows that three of six tests failed. Here is the test log: TestLog.txt

Then I try to used the compiled library to run the MaskRCNN model inference, it can be executed correctly with cpu, while return the same error with test log when calling cuda:

INFO: Model input name-[0] is: in_imgs  
INFO: Model output name-[0] is: out_boxes  
INFO: Model output name-[1] is: out_classes  
INFO: Model output name-[2] is: out_scores  
INFO: Model output name-[3] is: out_masks  
INFO: Succeed loading model  
INFO: All inference images: 10  
INFO: inference at img: 7_color.png  
2023-06-08 11:18:17.064007337 [E:onnxruntime:, sequential_executor.cc:368 Execute] Non-zero status code returned while running Sub node. Name:'Sub_3' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device  
Error: Non-zero status code returned while running Sub node. Name:'Sub_3' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device  
Gtk-Message: 11:18:44.239: Failed to load module "canberra-gtk-module"  
INFO: inference at img: 6_color.png  
2023-06-08 11:18:52.159243110 [E:onnxruntime:, sequential_executor.cc:368 Execute] Non-zero status code returned while running Sub node. Name:'Sub_3' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device  
Error: Non-zero status code returned while running Sub node. Name:'Sub_3' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device

Moveover, when running inference on the gpu, it takes far longer to load the model than on the cpu, which is not normal, do you know what the problem is?

Following is cuda environment:

nvidia@nvidianvidia:~$ ls -l /usr/local/ | grep cuda
lrwxrwxrwx  1 root root   22 8月  19  2022 cuda -> /etc/alternatives/cuda
lrwxrwxrwx  1 root root   25 8月  19  2022 cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 11 root root 4096 2月  28 18:45 cuda-11.4

nvidia@nvidianvidia:~$ cat /usr/local/cuda-11.4/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 4

nvidia@nvidianvidia:~$ echo $PATH
/usr/local/cmake-3.26.4-linux-aarch64/bin:/usr/local/cuda-11.4/bin:/home/nvidia/.local/bin:/home/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

nvidia@nvidianvidia:~$ echo $LD_LIBRARY_PATH
/usr/local/cuda-11.4/lib64:

@snnn
Copy link
Member

snnn commented Jun 8, 2023

Which GPU do you have on the device?

@GilbertPan97
Copy link
Author

Which GPU do you have on the device?

1792 NVIDIA CUDA cores and 56 Tensor Cores Ampere, CUDA samples can run correctly as follows:

[./alignedTypes] - Starting...
GPU Device 0: "Ampere" with compute capability 8.7

[Orin] has 14 MP(s) x 128 (Cores/MP) = 1792 (Cores)
> Compute scaling value = 1.00
> Memory Size = 49999872
Allocating memory...
Generating host input data array...
Uploading input data to GPU memory...
Testing misaligned types...
uint8...
Avg. time: 3.335375 ms / Copy throughput: 13.961251 GB/s.
	TEST OK
uint16...
Avg. time: 1.649063 ms / Copy throughput: 28.237868 GB/s.
	TEST OK
RGBA8_misaligned...
Avg. time: 1.212250 ms / Copy throughput: 38.412877 GB/s.
	TEST OK
LA32_misaligned...
Avg. time: 0.965500 ms / Copy throughput: 48.229943 GB/s.
	TEST OK
RGB32_misaligned...
Avg. time: 1.101094 ms / Copy throughput: 42.290685 GB/s.
	TEST OK
RGBA32_misaligned...
Avg. time: 1.020781 ms / Copy throughput: 45.618009 GB/s.
	TEST OK
Testing aligned types...
RGBA8...
Avg. time: 1.264625 ms / Copy throughput: 36.821992 GB/s.
	TEST OK
I32...
Avg. time: 1.231625 ms / Copy throughput: 37.808595 GB/s.
	TEST OK
LA32...
Avg. time: 1.111063 ms / Copy throughput: 41.911241 GB/s.
	TEST OK
RGB32...
Avg. time: 1.096000 ms / Copy throughput: 42.487237 GB/s.
	TEST OK
RGBA32...
Avg. time: 1.041906 ms / Copy throughput: 44.693090 GB/s.
	TEST OK
RGBA32_2...
Avg. time: 1.101938 ms / Copy throughput: 42.258302 GB/s.
	TEST OK

[alignedTypes] -> Test Results: 0 Failures
Shutting down...
Test passed

@GilbertPan97
Copy link
Author

I rebuild onnxruntime_providers_cuda, although no error is reported, but many warnings are thrown, such as:

[ 52%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/nvidia/Documents/onnxruntime/onnxruntime/core/providers/cuda/activation/activations_impl.cu.o
/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

[ 52%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/nvidia/Documents/onnxruntime/onnxruntime/core/providers/cuda/cuda_utils.cu.o
/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

[ 52%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/nvidia/Documents/onnxruntime/onnxruntime/core/providers/cuda/fpgeneric.cu.o
/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

[ 53%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/nvidia/Documents/onnxruntime/onnxruntime/core/providers/cuda/generator/random_impl.cu.o
/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

[ 53%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/nvidia/Documents/onnxruntime/onnxruntime/core/providers/cuda/generator/range_impl.cu.o
/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

[ 53%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/nvidia/Documents/onnxruntime/onnxruntime/core/providers/cuda/math/binary_elementwise_ops_impl.cu.o
/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

/home/nvidia/Documents/onnxruntime/onnxruntime/gsl/gsl-lite.hpp(1959): warning: calling a __host__ function from a __host__ __device__ function is not allowed

@snnn
Copy link
Member

snnn commented Jun 8, 2023

They are warnings. Not errors. The latest code doesn't have gsl-lite.hpp anymore.

@snnn
Copy link
Member

snnn commented Jun 8, 2023

Since your GPU is with "compute capability 8.7"
You can add --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=87 to your build command. Then surely nvcc will generate device code for your GPU.

@GilbertPan97
Copy link
Author

Since your GPU is with "compute capability 8.7" You can add --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=87 to your build command. Then surely nvcc will generate device code for your GPU.

I followed your comment and added CMAKE_CUDA_ARCHITECTURES=87 to my cmake project, while it still reported the same error when executing inference:

[E:onnxruntime:, sequential_executor.cc:368 Execute] Non-zero status code returned while running Sub node. Name:'Sub_3' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device  
Error: Non-zero status code returned while running Sub node. Name:'Sub_3' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device  

I try to execute with tensorRT, so I compile onnxruntime-tensorrt:

./build.sh --build_shared_lib --config Release --parallel --use_cuda --cudnn_home "/home/nvidia/Downloads/cudnn-11.4-linux-aarch64sbsa-v8.2.4.15/cuda" --cuda_home "/usr/local/cuda-11.4" --tensorrt_home "/home/nvidia/Downloads/TensorRT-8.2.5.1.Ubuntu-20.04.aarch64-gnu.cuda-11.4.cudnn8.2/TensorRT-8.2.5.1"

I got the following compilation error:

[ 63%] Built target onnxruntime_providers_cuda
make: *** [Makefile:166: all] Error 2
Traceback (most recent call last):
  File "/home/nvidia/Documents/onnxruntime/tools/ci_build/build.py", line 2744, in <module>
    sys.exit(main())
  File "/home/nvidia/Documents/onnxruntime/tools/ci_build/build.py", line 2663, in main
    build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
  File "/home/nvidia/Documents/onnxruntime/tools/ci_build/build.py", line 1301, in build_targets
    run_subprocess(cmd_args, env=env)
  File "/home/nvidia/Documents/onnxruntime/tools/ci_build/build.py", line 714, in run_subprocess
    return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
  File "/home/nvidia/Documents/onnxruntime/tools/python/util/run.py", line 49, in run
    completed_process = subprocess.run(
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/local/cmake-3.26.4-linux-aarch64/bin/cmake', '--build', '/home/nvidia/Documents/onnxruntime/build/Linux/Release', '--config', 'Release', '--', '-j8']' returned non-zero exit status 2.

@snnn
Copy link
Member

snnn commented Jun 12, 2023

May I know what device do you have? Is it an ARM server with Nvidia GPUs or a Jetson ?

@GilbertPan97
Copy link
Author

GilbertPan97 commented Jun 13, 2023

May I know what device do you have? Is it an ARM server with Nvidia GPUs or a Jetson ?

Jetson, NVIDIA Jetson AGX Orin
I made a mistake since I was using cuDNN-sbsa 8.2.4.

I found NVIDIA cuDNN archive does not provide cuDNN-jetson separately, which needs to be installed through jetpack, and there is no Jetpack version corresponding to cuda 11.4 and cuDNN 8.2.4 (Most versions of onnxruntime are compatible with cuda 11.4 and cuDNN 8.2.4, as shown in CUDA Execution Provider Requirements), is there any other solution for environment problem.

Jetpack Archive: https://developer.nvidia.com/embedded/jetpack-archive

Currently, the cuda environment is as follows:

- NVIDIA Jetson AGX Orin (Module Version)
 * Jetpack 5.0.2 GA [L4T 35.1.0]
 * NV Power Mode: MAXN - Type: 0
 * jetson_stats.service: active
- Libraries:
 * CUDA: 11.4.239
 * cuDNN: 8.4.1
 * TensorRT: 8.4.1.5
 * Visionworks: NOT_INSTALLED
 * OpencV: 4.5.4 compiled CUDA: NO
 * VpI: ii libnvvpi2 2.1.6 arm64 NVIDIA Viston Programming Interface library
 * Vulkan: 1.3.203

@snnn
Copy link
Member

snnn commented Jun 13, 2023

Sorry I don't have experience with Jetson. I searched around and found someone had a similar issue: "https://forums.developer.nvidia.com/t/issue-using-onnxruntime-with-cudaexecutionprovider-on-orin/219457/5" Would you try it?
And #16000 might also be related.

@GilbertPan97
Copy link
Author

Sorry I don't have experience with Jetson. I searched around and found someone had a similar issue: "https://forums.developer.nvidia.com/t/issue-using-onnxruntime-with-cudaexecutionprovider-on-orin/219457/5" Would you try it? And #16000 might also be related.

I tried that previous link you mentioned, and it did work on NVIDIA Jetson AGX Orin, thanks again for your help.

@snnn
Copy link
Member

snnn commented Jun 14, 2023

Would you mind elaborating more what you changed? It seems we have SM87 in our cmake file:

https://github.com/microsoft/onnxruntime/blob/main/cmake/CMakeLists.txt#L1312

But why it did not work, and how did you make it work?

@GilbertPan97
Copy link
Author

Would you mind elaborating more what you changed? It seems we have SM87 in our cmake file:

https://github.com/microsoft/onnxruntime/blob/main/cmake/CMakeLists.txt#L1312

But why it did not work, and how did you make it work?

For the problem I encountered, it was caused by the unsuitable hardware of the nvidia developer kit (NVIDIA Jetson AGX Orin). The best way is to install the appropriate version of jetpack instead of installing cuda, cuDNN, etc. separately, because the cuDNN-jetson are not provided on the archive page. After installing jetpack, just recompile onnxruntime (make sure SM87 is in the cmake file, which needs to be added manually in the previous version 1.12)
Jetpack archive link: https://developer.nvidia.com/embedded/jetpack-archive

@snnn
Copy link
Member

snnn commented Jun 15, 2023

Thanks for the detailed explanation

@snnn snnn closed this as completed Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build build issues; typically submitted using template ep:CUDA issues related to the CUDA execution provider feature request request for unsupported feature or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants