Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Jetson Orin embedded cards (i.e. add "8.7" to TORCH_CUDA_ARCH_LIST in linux-aarch64) ? #303

Open
1 task done
traversaro opened this issue Dec 11, 2024 · 15 comments
Labels
bug Something isn't working

Comments

@traversaro
Copy link

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

Yesterday at work tried to use the latest cuda-enabled pytorch conda-forge package on a linux-aarch64 platform, the Jetson Orin (https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/, https://developer.nvidia.com/embedded/jetson-modules), that is a really popular board in robotics (at least for R&D), as it is relatively compact embedded board with a CUDA-enabled CPU board.

However, running this minimal snippet:

import torch

torch.cuda.is_available() # will return true

t = torch.rand(10) # create a random tensor of 10 elements
t.to("cuda") # try to move the tensor to the gpu, this should trigger the error

resulted in the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ergocub/miniforge3/envs/testenv/lib/python3.12/site-packages/torch/_tensor.py", line 523, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ergocub/miniforge3/envs/testenv/lib/python3.12/site-packages/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ergocub/miniforge3/envs/testenv/lib/python3.12/site-packages/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ergocub/miniforge3/envs/testenv/lib/python3.12/site-packages/torch/_tensor_str.py", line 357, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ergocub/miniforge3/envs/testenv/lib/python3.12/site-packages/torch/_tensor_str.py", line 146, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The reason for the error is quite clear: the compute architecture used by Jetson Orin is 8.7 (see https://developer.nvidia.com/cuda-gpus), and this architecture is not part of TORCH_CUDA_ARCH_LIST for linux-aarch64 .

Indeed, I tried to quickly generate packages in which the architecture is enabled (code: https://github.com/traversaro/pytorch-cpu-traversaro-fork/tree/patch-1, binary packages: https://github.com/traversaro/pytorch-cpu-traversaro-fork/releases/tag/test-builds-on-aarch64-with-sm87), and that snippet works fine.

Given how widespread the use Jetson Orin is (the whole Jetson series sold ~1 Million as of 2023, see https://blogs.nvidia.com/blog/million-jetson-developers-gtc/) I would argue that it could make sense to add 8.7 to TORCH_CUDA_ARCH_LIST only on linux-aarch64, as the 8.7 is a compute architecture that is not used in discrete GPUs used on linux-64 system (at least as far as I know).

While I understand that the amount of compute architectures that are enabled is always a tradeoff between maintenance effort, time taken in CI, package size, I wonder if the 8.7 architecture used in Jetson is more widespread on linux-aarch64 as opposed to (for example) 5.0, as I doubt there are a lot (for example) of GTX 750 GPUs installed in a system with a ARM CPU (even if the card is indeed supported by linux-aarch64 drivers, see "Supported Products" in https://www.nvidia.com/en-us/drivers/details/237590/).

Installed packages

_openmp_mutex      conda-forge/linux-aarch64::_openmp_mutex-4.5-2_gnu 
  attr               conda-forge/linux-aarch64::attr-2.5.1-h4e544f5_1 
  bzip2              conda-forge/linux-aarch64::bzip2-1.0.8-h68df207_7 
  ca-certificates    conda-forge/linux-aarch64::ca-certificates-2024.8.30-hcefe29a_0 
  cpython            conda-forge/noarch::cpython-3.13.1-py313hd8ed1ab_102 
  cuda-cudart        conda-forge/linux-aarch64::cuda-cudart-12.6.77-h3ae8b8a_0 
  cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-aarch64-12.6.77-h3ae8b8a_0 
  cuda-nvrtc         conda-forge/linux-aarch64::cuda-nvrtc-12.6.85-h5101a13_0 
  cuda-nvtx          conda-forge/linux-aarch64::cuda-nvtx-12.6.77-h5101a13_0 
  cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
  cudnn              conda-forge/linux-aarch64::cudnn-9.3.0.75-h125f736_1 
  filelock           conda-forge/noarch::filelock-3.16.1-pyhd8ed1ab_1 
  fsspec             conda-forge/noarch::fsspec-2024.10.0-pyhd8ed1ab_1 
  gmp                conda-forge/linux-aarch64::gmp-6.3.0-h0a1ffab_2 
  gmpy2              conda-forge/linux-aarch64::gmpy2-2.1.5-py313h0c041f1_3 
  jinja2             conda-forge/noarch::jinja2-3.1.4-pyhd8ed1ab_1 
  ld_impl_linux-aar~ conda-forge/linux-aarch64::ld_impl_linux-aarch64-2.43-h80caac9_2 
  libabseil          conda-forge/linux-aarch64::libabseil-20240722.0-cxx17_h5ad3122_1 
  libblas            conda-forge/linux-aarch64::libblas-3.9.0-25_linuxaarch64_openblas 
  libcap             conda-forge/linux-aarch64::libcap-2.71-h51d75a7_0 
  libcblas           conda-forge/linux-aarch64::libcblas-3.9.0-25_linuxaarch64_openblas 
  libcublas          conda-forge/linux-aarch64::libcublas-12.6.4.1-h5101a13_0 
  libcufft           conda-forge/linux-aarch64::libcufft-11.3.0.4-h5101a13_0 
  libcufile          conda-forge/linux-aarch64::libcufile-1.11.1.6-hd706c01_3 
  libcurand          conda-forge/linux-aarch64::libcurand-10.3.7.77-h5101a13_0 
  libcusolver        conda-forge/linux-aarch64::libcusolver-11.7.1.2-h5101a13_0 
  libcusparse        conda-forge/linux-aarch64::libcusparse-12.5.4.2-h5101a13_0 
  libexpat           conda-forge/linux-aarch64::libexpat-2.6.4-h5ad3122_0 
  libffi             conda-forge/linux-aarch64::libffi-3.4.2-h3557bc0_5 
  libgcc             conda-forge/linux-aarch64::libgcc-14.2.0-he277a41_1 
  libgcc-ng          conda-forge/linux-aarch64::libgcc-ng-14.2.0-he9431aa_1 
  libgcrypt-lib      conda-forge/linux-aarch64::libgcrypt-lib-1.11.0-h86ecc28_2 
  libgfortran        conda-forge/linux-aarch64::libgfortran-14.2.0-he9431aa_1 
  libgfortran5       conda-forge/linux-aarch64::libgfortran5-14.2.0-hb6113d0_1 
  libgomp            conda-forge/linux-aarch64::libgomp-14.2.0-he277a41_1 
  libgpg-error       conda-forge/linux-aarch64::libgpg-error-1.51-h05609ea_1 
  liblapack          conda-forge/linux-aarch64::liblapack-3.9.0-25_linuxaarch64_openblas 
  liblzma            conda-forge/linux-aarch64::liblzma-5.6.3-h86ecc28_1 
  libmagma           conda-forge/linux-aarch64::libmagma-2.8.0-hfc09aae_0 
  libmagma_sparse    conda-forge/linux-aarch64::libmagma_sparse-2.8.0-hfc09aae_0 
  libmpdec           conda-forge/linux-aarch64::libmpdec-4.0.0-h68df207_0 
  libnl              conda-forge/linux-aarch64::libnl-3.11.0-h86ecc28_0 
  libnvjitlink       conda-forge/linux-aarch64::libnvjitlink-12.6.85-h5101a13_0 
  libopenblas        conda-forge/linux-aarch64::libopenblas-0.3.28-pthreads_h9d3fd7e_1 
  libprotobuf        conda-forge/linux-aarch64::libprotobuf-5.28.2-h029595c_0 
  libsqlite          conda-forge/linux-aarch64::libsqlite-3.47.2-h5eb1b54_0 
  libstdcxx          conda-forge/linux-aarch64::libstdcxx-14.2.0-h3f4de04_1 
  libstdcxx-ng       conda-forge/linux-aarch64::libstdcxx-ng-14.2.0-hf1166c9_1 
  libsystemd0        conda-forge/linux-aarch64::libsystemd0-256.9-ha536d29_2 
  libtorch           conda-forge/linux-aarch64::libtorch-2.5.1-cuda126_hdf69377_206 
  libudev1           conda-forge/linux-aarch64::libudev1-256.9-h1187dce_2 
  libuuid            conda-forge/linux-aarch64::libuuid-2.38.1-hb4cce97_0 
  libuv              conda-forge/linux-aarch64::libuv-1.49.2-h86ecc28_0 
  libzlib            conda-forge/linux-aarch64::libzlib-1.3.1-h86ecc28_2 
  lz4-c              conda-forge/linux-aarch64::lz4-c-1.10.0-h5ad3122_1 
  markupsafe         conda-forge/linux-aarch64::markupsafe-3.0.2-py313h7815b11_1 
  mpc                conda-forge/linux-aarch64::mpc-1.3.1-h783934e_1 
  mpfr               conda-forge/linux-aarch64::mpfr-4.2.1-h2305555_3 
  mpmath             conda-forge/noarch::mpmath-1.3.0-pyhd8ed1ab_1 
  nccl               conda-forge/linux-aarch64::nccl-2.23.4.1-h35f4307_3 
  ncurses            conda-forge/linux-aarch64::ncurses-6.5-hcccb83c_1 
  networkx           conda-forge/noarch::networkx-3.4.2-pyh267e887_2 
  nomkl              conda-forge/noarch::nomkl-1.0-h5ca1d4c_0 
  numpy              conda-forge/linux-aarch64::numpy-2.2.0-py313haaed576_0 
  openssl            conda-forge/linux-aarch64::openssl-3.4.0-h86ecc28_0 
  pip                conda-forge/noarch::pip-24.3.1-pyh145f28c_0 
  python             conda-forge/linux-aarch64::python-3.13.1-h3e021d1_102_cp313 
  python_abi         conda-forge/linux-aarch64::python_abi-3.13-5_cp313 
  pytorch            conda-forge/linux-aarch64::pytorch-2.5.1-cuda126_py313h4ef7031_206 
  rdma-core          conda-forge/linux-aarch64::rdma-core-54.0-h1d056c8_1 
  readline           conda-forge/linux-aarch64::readline-8.2-h8fc344f_1 
  setuptools         conda-forge/noarch::setuptools-75.6.0-pyhff2d567_1 
  sleef              conda-forge/linux-aarch64::sleef-3.7-h8fb0607_2 
  sympy              conda-forge/noarch::sympy-1.13.3-pyh2585a3b_104 
  tk                 conda-forge/linux-aarch64::tk-8.6.13-h194ca79_0 
  typing_extensions  conda-forge/noarch::typing_extensions-4.12.2-pyha770c72_1 
  tzdata             conda-forge/noarch::tzdata-2024b-hc8b5060_0 
  zstd               conda-forge/linux-aarch64::zstd-1.5.6-h02f22dd_0

Environment info

active environment : xbg
    active env location : /home/ergocub/miniforge3/envs/xbg
            shell level : 1
       user config file : /home/ergocub/.condarc
populated config files : /home/ergocub/miniforge3/.condarc
                          /home/ergocub/.condarc
          conda version : 24.9.2
    conda-build version : not installed
         python version : 3.12.7.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=cortex_a72
                          __conda=24.9.2=0
                          __cuda=12.6=0
                          __glibc=2.35=0
                          __linux=5.15.148=0
                          __unix=0=0
       base environment : /home/ergocub/miniforge3  (writable)
      conda av data dir : /home/ergocub/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-aarch64
https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/ergocub/miniforge3/pkgs
                          /home/ergocub/.conda/pkgs
       envs directories : /home/ergocub/miniforge3/envs
                          /home/ergocub/.conda/envs
               platform : linux-aarch64
             user-agent : conda/24.9.2 requests/2.32.3 CPython/3.12.7 Linux/5.15.148-tegra ubuntu/22.04.5 glibc/2.35 solver/libmamba conda-libmamba-solver/24.9.0 libmambapy/1.5.9
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False
 
has context menu


has context menu
@traversaro traversaro added the bug Something isn't working label Dec 11, 2024
@traversaro
Copy link
Author

fyi @giotherobot @S-Dafarra @carloscp3009

@traversaro
Copy link
Author

The change I am suggesting to make is traversaro@bbc22e0 . I am not opening the PR to avoid overloading the CI unless necessary.

@traversaro
Copy link
Author

The problem is that you also need cudnn and cublas compiled with sm87, and those are not available. So we need to first sort out that.

@Tobias-Fischer
Copy link
Contributor

+1 for this request, can strongly support the robotics need

@traversaro
Copy link
Author

The problem is that you also need cudnn and cublas compiled with sm87, and those are not available. So we need to first sort out that.

Just an update, by installing at the system level the consistent versions of cudnn and cublas with sm_87 enabled and forcing PyTorch to use those, the PyTorch package with sm_87 enabled in https://github.com/traversaro/pytorch-cpu-traversaro-fork/releases/tag/test-builds-on-aarch64-with-sm87 works fine.

@hmaarrfk
Copy link
Contributor

+1 from me

@traversaro
Copy link
Author

Related: conda-forge/arm-variant-feedstock#3 .

@traversaro
Copy link
Author

traversaro commented Dec 11, 2024

I discovered some interesting stuff about support in conda-forge for jetson-like boards (that are probably obvious for Nvidia/CUDA people, but I write it down for non-expert people like me), basically the cuda packages for linux-aarch64 are divided in sbsa (basically for Nvidia ARM servers) and tegra (Nvidia ARM embedded boards). The arm-variant meta-package can be used to select one or another, and indeed if we try to install pytorch-gpu with arm-variant tegra, the installation fails:

CONDA_OVERRIDE_CUDA=12.6 CONDA_SUBDIR=linux-aarch64 conda create -n pytorchgputegra "arm-variant=*=*tegra*" pytorch-gpu
Could not solve for environment specs
The following packages are incompatible
├─ arm-variant * *tegra* is requested and can be installed;
└─ pytorch-gpu is not installable because it requires
   └─ pytorch [2.1.2 cuda120_py310h13b6a1d_203|2.1.2 cuda120_py38hca0abb7_204|...|2.5.1 cuda126_py310h90e4772_201], which requires
      └─ libcusparse [>=12.0.0.76,<13.0a0 |>=12.5.4.2,<13.0a0 ], which requires
         └─ arm-variant * sbsa, which conflicts with any installable versions previously reported.

and this is due to the fact that libcusparse (as all the cuda packages in conda-forge) is only available for sbsa, not tegra.

Based on conda-forge/cusparselt-feedstock#48 (comment), at least for cudnn the difference compute architecture enabled for the two cases are:

Compute Capability linux-aarch64 (i.e. arm-variant=*=*tegra*) linux-sbsa (i.e. arm-variant=*=*sbsa*)
sm_50
sm_53
sm_60
sm_61
sm_62
sm_70
sm_72
sm_75
sm_80
sm_86
sm_87
sm_90

The difference are not so big. From the point of view of conda-forge, I think we can safely ignore sm_53 , sm_62 and sm_72 , as they are architectures related to Jetson Nano, Jetson TX2 and Jetson Xavier, that if I am not wrong are not compatible with CUDA 12. So the only compute capability still relevant for CUDA 12 is sm_87, so even for cuda libraries we need to have the two different variants for sbsa and tegra, at the pytorch level I think it is just easier to enable also sm_87 as part of the existing linux-aarch64 build.

@traversaro
Copy link
Author

Ok, I think at this point I have a more clear idea of the situation:

  • As conda-forge now only supports CUDA 12, and the only Jetson that supports CUDA 12 is Jetson Origin with Jetpack 6, I think we only need to care about sm_87 (the architecture of Jetson Orin)
  • conda-forge packages that only build cuda code and do not rely on any other cudatoolkit library (except for the cuda runtime) work fine on Jetson Orin if they also build the sm_87. This is for example the case of librealsense.
  • For getting a working PyTorch+CUDA Jetson Orin installation, you need to compile PyTorch by adding 8.7 to TORCH_CUDA_ARCH_LIST and have a cudnn and cublas with sm_87 support. For doing the latter in conda-forge I opened Add cuda packages with 8.7 compute capability enabled cuda-feedstock#57, but that would probably take a bit.
  • Even if we will not have cudnn and cublas with sm_87 support in conda-forge in the super short term, I think the workaround for that is relatively easy (even if brittle), as you just need to make sure that PyTorch loads the correct cudnn and cublas with sm_87 supports, that can be easily downloaded from NVIDIA's redist archives. On the other hand, building PyTorch with 8.7 added to TORCH_CUDA_ARCH_LIST is far from easy, fast or trivial.

So, given all of this, my proposal is (that also works as TL;DR: if you do not want to read the bullet points): while we wait for cudnn and cublas with sm_87 support in conda-forge, could we just add 8.7 to linux-aarch64 builds? The benefit for this is two fold:

  • As soon cudnn and cublas with sm_87 support are available in conda-forge, we will get a whole range of PyTorch+CUDA version supported out of the box
  • From now, we easily permit power users to easily use PyTorch+CUDA support on Jetson Orin, by just manually changing the cudnn and cublas builds to the sm_87-enabled ones, without forcing them to re-build PyTorch.

Given that we are not particularly in a hurry, I think it does not make a lot of sense of opening a PR that will use a lot of CI resources and users disk space to just do a rebuild for the traversaro@bbc22e0 change. Instead, could it make sense to add this change to one of the upcoming PRs? Thanks a lot in advance!

@hmaarrfk
Copy link
Contributor

We can support 11. But we need a champion for it

@traversaro
Copy link
Author

We can support 11. But we need a champion for it

Sure, but until (and unless) that happens, if we only build for CUDA >= 12 at the Jetson Orin level we just need to enable the sm_87 architecture.

@hmaarrfk
Copy link
Contributor

while we wait for cudnn and cublas with sm_87 support in conda-forge, could we just add 8.7 to linux-aarch64 builds?

so it will compile it, and users will have to do gymnastics to get it to work?

I'm ok with that if you can help document the gymnastics in a short summary.

@isuruf
Copy link
Member

isuruf commented Dec 23, 2024

could we just add 8.7 to linux-aarch64 builds?

I'm not sure I understand the need here. CUDA binaries created for compute capability X.Y is compatibile with X.Z where Z>=Y. Since we build for 8.6, it should work for 8.7. Can you check that the existing pytorch package in conda-forge works with the system cudnn, cublas, etc.?

@traversaro
Copy link
Author

could we just add 8.7 to linux-aarch64 builds?

I'm not sure I understand the need here. CUDA binaries created for compute capability X.Y is compatibile with X.Z where Z>=Y. Since we build for 8.6, it should work for 8.7. Can you check that the existing pytorch package in conda-forge works with the system cudnn, cublas, etc.?

Good point. To be honest I am always a bit confused about how this works. Indeed I will try (once I have again access to a Orin) using a stock conda-forge pytorch with cudnn and cublas with sm87, and report the result.

@carterbox
Copy link
Member

As @isuruf mentioned, normally we have a binary compatibility guarantee that means packages built for any compute capability >=8.0,<=8.7 should be compatible with a device of compute capability 8.7. However, the documentation has the following admonition:

Binary compatibility is supported only for the desktop. It is not supported for Tegra. Also, the binary compatibility between desktop and Tegra is not supported.

I'm not sure whether this is due to incompatibility with host code, device code, or both. Thus, the solution is probably to target mutually exclusive capabilities for sbsa and tegra instead of trying to add 8.7 to the sbsa build since it would never be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants