-
Notifications
You must be signed in to change notification settings - Fork 2k
GPU becomes unavailable after some time in Docker container #1469
Comments
Is your issue possibly related to this: If you call We are in the process of rearchitecting the container stack to avoid problems like these in the future. But that work is still a few months out. |
hey, thanks alot for the fast answer @klueska I'll try if the bug can be reproduced by this. |
No worries. The underlying issue is summarized here: Whether it's a call to It's a fundamental flaw in the way |
In the end we found a working configuration by downgrading the machines to Ubuntu 18.04, which gave us the combination of the old, working versions of the nvidia container libraries we used under 16.04 and up to date driver packages.
Thanks again for pointing us in the direction of the nvidia container libraries @klueska. |
PS: I was able to reproduce the |
@tobigue Just an update I believe the underlying issue you are experiencing is related to this: I have proposed the following patch upstream to upstream K8s to help workaround this and will back port it to 1.19, 1.20, and 1.21 once it is merged: kubernetes/kubernetes#101771 It is not a fix of the root cause (for that you will need to update to a newer |
@klueska |
I still have this problem in a machine based on nvcr.io/nvidia/pytorch:23.08-py3 (and several other machines). Has this been addressed and maybe I have to update the host in some way? Or do I still have to wait? The host is ubuntu 22.04, should I upgrade it to solve this? |
If it's helpful. The systems I have with version 530.41.03 of the nvidia driver are fine the one recently upgraded to 535.129.03 are having the issue. |
1. Issue or feature description
Hello,
after updating the software on some of our workstations, we have the problem that GPUs become unavailable in a docker image.
We first noticed this, when PyTorch experiments failed on the second script called in the container with a
RuntimeError: No CUDA GPUs are available
.While trying to debug this, we noticed that also just starting the container with
nvidia-docker run --rm -it nvidia/cuda:11.2.1-devel-ubuntu20.04 bash
and running awatch -n 1 nvidia-smi
inside the container does not work as expected. First, the output is as expected, but after some time (which varies between a few seconds and several hours) the output changes toFailed to initialize NVML: Unknown Error
.We could reproduce the error with different Docker images, such as
nvidia/cuda:11.2.1-devel-ubuntu20.04
and images based onnvcr.io/nvidia/pytorch:20.12-py3
andpytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime
.We have reproduced this bug on different workstations with completely different hardware and GPUs (GTX 1080 Ti and RTX 3090).
Setups that do NOT work (GTX 1080 Ti and RTX 3090 workstations) are:
Ubuntu 20.04 (nvidia-docker2 2.5.0-1):
A Setup that DOES WORK (on the same GTX 1080 Ti machine) is:
Ubuntu 16.04 (nvidia-docker2 2.0.3+docker18.09.2-1):
So we suspect that the problem is either in newer versions of the kernel, driver or nvidia-docker of the host machine.
We are looking for advice how to debug this further and fix the problem.
What are things we could try to run on the host and inside the container, while we have a container running that is in the erroneous state to find out what exactely the problem is?
Thanks for any help!
2. Steps to reproduce the issue
E.g.
nvidia-docker run --rm -it nvidia/cuda:11.2.1-devel-ubuntu20.04 bash
on a system with ubuntu 20.04 andwatch -n 1 nvidia-smi
inside the container (might take minutes to several hours).3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
Linux ws-3090-enterprise 5.4.0-65-generic NVIDIA/nvidia-docker#73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
/var/log/nvidia-container-runtime.log
/var/log/nvidia-container-toolkit.log
nvidia-docker run --rm -it nvidia/cuda:11.2.1-devel-ubuntu20.04 bash
->watch nvidia-smi
nvidia-docker run --rm -it nvcr.io/nvidia/pytorch:20.12-py3 bash
->watch nvidia-smi
orThe text was updated successfully, but these errors were encountered: