-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containerd pods logging "Failed to initialize NVML: Unknown Error" #932
Comments
Additionally tested
Working as expected. |
Is there a good way to do a |
Strange, so even after enabling CDI, I am getting the same behaviour. Before the pod is restarted
After
The devices which are missing in the initially run pod all have a creation timestamp about the same as the present ones. Strange. |
Hi, after switching over to the systemd driver after migrating to cgroupv2 (over cgroupfs) for both containerd + kubelet, we noticed pods which are calling
nvidia-smi
began failing with"Failed to initialize NVML: Unknown Error"
.We are not using the GPU operator, instead we are relying on the container toolkit only.
I found this issue #48 and spent some time rejigging the suggestions:
udev
rule72-nvidia-dev-char.rules
(as I wanted it to execute after71-nvidia.rules
since they were having a race condition wherenvidia-ctk ...
was trying to generate the symlinks before all the drivers were loaded (and/dev/*
were unavailable, I thinknvidia-uvm
)And the service is designed to run before
containerd
as I was seeing instances where the symlinks were created just after containerd / pods started up.This all works as expected (symlinks are created before containerd starts), but I am seeing that the initial pod which runs
nvidia-smi
fails with "Failed to initialize NVML: Unknown Error". If I kill/recreate the pod, it begins to work.I looked into the
/dev
filesystem of the bad/good pod and noticed that/dev/nvidia-uvm
is not present in the bad pod (sorry for the formatting, had tostat
the devices.(after / before)

I also performed a
crictl inspect(p)
and compared both the pods and container specs, no notable differences between the two.I also ran a bpftrace on any opening of files within the container and saw that it was getting an error on
openat('/dev/nvidiactl')
although the char device is mounted into the container, not sure.I am starting to think that CDI is going to be the better approach as the existing solution seems a bit brittle, does CDI support mounting
nvidia-smi
into GPU enabled pods?Versions and stuff
containerd 1.7.25
Ubuntu 22.04.5 LTS
The text was updated successfully, but these errors were encountered: