-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add context around CUDA driver vs kernel versions #2
Comments
Perhaps
explains the full extent of this though. If CUDA drivers are mostly backwards compatible, then if you're trying to make a Docker image that works on most machines with CUDA installed then perhaps it is best to try and target slightly older releases(?). Using my local machine as an example: $ nvidia-smi | head -n 4
Thu Dec 1 18:35:01 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+ I'm able to run $ docker run --rm -ti --gpus all pyhf/cuda:0.7.0-jax-cuda-11.6.0-cudnn8
root@415cb7459135:/home/data# nvidia-smi
Fri Dec 2 04:09:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 49C P0 15W / N/A | 4MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@415cb7459135:/home/data# echo "${CUDA_VERSION}"
11.6.0
root@415cb7459135:/home/data# python /docker/jax_detect_GPU.py
XLA backend type: gpu
Number of GPUs found on system: 1
Active GPU index: 0
Active GPU name: NVIDIA GeForce RTX 3050 Ti Laptop GPU
root@415cb7459135:/home/data# and the older $ docker run --rm -ti --gpus all pyhf/cuda:0.6.3-jax-cuda-11.1
root@1986c33106f6:/home/data# nvidia-smi
Fri Dec 2 04:11:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 47C P0 15W / N/A | 4MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@1986c33106f6:/home/data# echo "${CUDA_VERSION}"
11.1.1
root@1986c33106f6:/home/data# curl -sL https://raw.githubusercontent.com/matthewfeickert/nvidia-gpu-ml-library-test/main/jax_detect_GPU.py | python
XLA backend type: gpu
Number of GPUs found on system: 1
Active GPU index: 0
Active GPU name: NVIDIA GeForce RTX 3050 Ti Laptop GPU
root@1986c33106f6:/home/data# but the $ docker run --rm -ti --gpus all pyhf/cuda:0.7.0-jax-cuda-11.8.0-cudnn8
root@8815d82e76c5:/home/data# nvidia-smi
Fri Dec 2 04:15:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 50C P0 15W / N/A | 4MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@8815d82e76c5:/home/data# echo "${CUDA_VERSION}"
11.8.0
root@8815d82e76c5:/home/data# python -c 'from jax.lib import xla_bridge; xla_bridge.get_backend()'
2022-12-02 04:15:40.881759: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2022-12-02 04:15:40.881850: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 510.85.2 does not match DSO version 520.61.5 -- cannot find working devices in this configuration
WARNING:jax._src.lib.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
root@8815d82e76c5:/home/data# There's still the question I guess of if/how Python wheels compiled for newer CUDA architectures work with older CUDA versions. That is, will does
in them and then seeing if more modern versions of @kratsg given all this, what is the $ docker run --rm -ti --gpus all pyhf/cuda:0.7.0-jax-cuda-11.2.2
root@54ea11142a60:/home/data# nvidia-smi
Fri Dec 2 04:40:31 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 47C P0 15W / N/A | 4MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@54ea11142a60:/home/data# echo "${CUDA_VERSION}"
11.2.2
root@54ea11142a60:/home/data# python /docker/jax_detect_GPU.py
XLA backend type: gpu
Number of GPUs found on system: 1
Active GPU index: 0
Active GPU name: NVIDIA GeForce RTX 3050 Ti Laptop GPU
root@54ea11142a60:/home/data# |
Not to abuse your time, but @bdice, as you're an expert's expert when it comes to CUDA, do you have general recommendations (or resources to look at) when thinking about building CUDA enabled Docker images for software with the goal of running code on GPUs on remote clusters and how to try to make all the CUDA versions and binaries play nicely together? My current (weak) understanding is that you:
|
@matthewfeickert Hi! Sorry, I've been working through an email backlog and just saw this. I think your understanding is approximately correct. This webpage is the definitive resource for CUDA compatibility: https://docs.nvidia.com/deploy/cuda-compatibility/index.html It is admittedly complex, and I won't guarantee that my answers here are 100% correct. I have worked through a handful of exceptional cases for CUDA compatibility and I'm still learning a lot about the minutiae of this topic. There are multiple kinds of compatibility described in that document above. I will attempt to summarize some of the pieces I think are most important to know:
I haven't dealt with compatibility questions including Conclusion: Leveraging CUDA compatibility is great if it works for your use case. If you're not sure about your application requirements (or your dependencies' requirements) or if things just aren't working, you can always build multiple containers for each version of CUDA you need to support and things should be fine. |
I don't fully understand the subtleties of trying to match CUDA drivers on Ubuntu (https://github.com/matthewfeickert/nvidia-gpu-ml-library-test is basically just me recording what commands I typed that worked) and getting those to match with the kernel versions that different wheels with cudnn were built against.
In the https://github.com/CHTC/templates-GPUs examples they mention
which is why in PR #1 I set
htcondor-examples/noxfile.py
Line 18 in 0252615
and
htcondor-examples/htcondor_templates/chtc_hello_gpu/chtc_hello_gpu.sub
Lines 20 to 21 in 0252615
as I could get CUDA 11.6.0 image to run on my local machine for interactive testing
and still assumed that HCTC would be able to have a machine that supports them. I had originally tried with
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
but that failed withwhich lead me down a bit of a rabbit hole.
@kratsg I think you have a much better handle on how to try to match drivers and conditions. So maybe we could try to document an approach here or on https://github.com/pyhf/cuda-images about how to go about finding the right match of CUDA driver,
nvidia/cuda
Docker image, and software releases for the problems someone might want to solve. We could additionally think about building a battery of images against common CUDA versions if that might be helpful for running on more sites.The text was updated successfully, but these errors were encountered: