Containerd pods logging "Failed to initialize NVML: Unknown Error" #932

alam0rt · 2025-02-19T05:03:19Z

Hi, after switching over to the systemd driver after migrating to cgroupv2 (over cgroupfs) for both containerd + kubelet, we noticed pods which are calling nvidia-smi began failing with "Failed to initialize NVML: Unknown Error".

We are not using the GPU operator, instead we are relying on the container toolkit only.

I found this issue #48 and spent some time rejigging the suggestions:

I created the following udev rule 72-nvidia-dev-char.rules (as I wanted it to execute after 71-nvidia.rules since they were having a race condition where nvidia-ctk ... was trying to generate the symlinks before all the drivers were loaded (and /dev/* were unavailable, I think nvidia-uvm)

# This will create /dev/char symlinks to all device nodes
# This is required or else the nvidia device nodes will not be present
# under /dev/char and thus unavailable to containers run using the systemd driver
# This must run after the drivers have been loaded (after `71-nvidia.rules`)
# See https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", TAG+="systemd", ENV{SYSTEMD_WANTS}+="nvidia-dev-char.service"

And the service is designed to run before containerd as I was seeing instances where the symlinks were created just after containerd / pods started up.

[Unit]
Description=NVIDIA /dev/char symlink creation
Wants=syslog.target
Before=containerd.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all

[Install]
WantedBy=multi-user.target

This all works as expected (symlinks are created before containerd starts), but I am seeing that the initial pod which runs nvidia-smi fails with "Failed to initialize NVML: Unknown Error". If I kill/recreate the pod, it begins to work.

I looked into the /dev filesystem of the bad/good pod and noticed that /dev/nvidia-uvm is not present in the bad pod (sorry for the formatting, had to stat the devices.

(after / before)

I also performed a crictl inspect(p) and compared both the pods and container specs, no notable differences between the two.

I also ran a bpftrace on any opening of files within the container and saw that it was getting an error on openat('/dev/nvidiactl') although the char device is mounted into the container, not sure.

I am starting to think that CDI is going to be the better approach as the existing solution seems a bit brittle, does CDI support mounting nvidia-smi into GPU enabled pods?

Versions and stuff

# apt show nvidia-compute-utils-555
Package: nvidia-compute-utils-555
Version: 555.42.06-0ubuntu1

containerd 1.7.25
Ubuntu 22.04.5 LTS

The text was updated successfully, but these errors were encountered:

alam0rt · 2025-02-19T05:29:44Z

Additionally tested nerdctl

# nerdctl run -it --rm --gpus all nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi
docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04:                                    resolved       |++++++++++++++++++++++++++++++++++++++| 
index-sha256:0654b44e2515f03b811496d0e2d67e9e2b81ca1f6ed225361bb3e3bb67d22e18:    done           |++++++++++++++++++++++++++++++++++++++| 
manifest-sha256:7d8fdd2a5e96ec57bc511cda1fc749f63a70e207614b3485197fd734359937e7: done           |++++++++++++++++++++++++++++++++++++++| 
config-sha256:d13839a3c4fbd332f324c135a279e14c432e90c8a03a9cedc43ddf3858f882a7:   done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:1698c67699a3eee2a8fc185093664034bb69ab67c545ab6d976399d5500b2f44:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:c5f2ffd06d8b1667c198d4f9a780b55c86065341328ab4f59d60dc996ccd5817:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:25ad149ed3cff49ddb57ceb4418377f63c897198de1f9de7a24506397822de3e:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:ba7b66a9df40b8a1c1a41d58d7c3beaf33a50dc842190cd6a2b66e6f44c3b57b:    done           |++++++++++++++++++++++++++++++++++++++| 
elapsed: 6.0 s                                                                    total:  88.3 M (14.7 MiB/s)                                      
INFO[0006] No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4] 
INFO[0006] IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844] 
Wed Feb 19 05:27:14 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   21C    P8             10W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Working as expected.

alam0rt · 2025-02-19T06:10:54Z

Is there a good way to do a nerdctl run --gpu all but exclude mounting /dev/nvidia-uvm ? I tried a few variations but it seems the oci hook overwrites anything I do to it

alam0rt · 2025-02-19T23:48:54Z

Strange, so even after enabling CDI, I am getting the same behaviour.

Before the pod is restarted

bash-5.1# for f in /dev/*; do echo $f; done
/dev/core
/dev/fd
/dev/full
/dev/kmsg
/dev/mqueue
/dev/null
/dev/nvidia-uvm
/dev/nvidia0
/dev/nvidiactl
/dev/ptmx
/dev/pts
/dev/random
/dev/shm
/dev/stderr
/dev/stdin
/dev/stdout
/dev/termination-log
/dev/tty
/dev/urandom
/dev/zero

After

bash-5.1# for f in /dev/*; do echo $f; done                                                                                                                                                  
/dev/core
/dev/dri
/dev/fd
/dev/full
/dev/kmsg
/dev/mqueue
/dev/null
/dev/nvidia-caps
/dev/nvidia-modeset
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia0
/dev/nvidiactl
/dev/ptmx
/dev/pts
/dev/random
/dev/shm
/dev/stderr
/dev/stdin
/dev/stdout
/dev/termination-log
/dev/tty
/dev/urandom
/dev/zero

The devices which are missing in the initially run pod all have a creation timestamp about the same as the present ones. Strange.

alam0rt closed this as completed Feb 19, 2025

alam0rt reopened this Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containerd pods logging "Failed to initialize NVML: Unknown Error" #932

Containerd pods logging "Failed to initialize NVML: Unknown Error" #932

alam0rt commented Feb 19, 2025 •

edited

Loading

alam0rt commented Feb 19, 2025

alam0rt commented Feb 19, 2025

alam0rt commented Feb 19, 2025

Containerd pods logging "Failed to initialize NVML: Unknown Error" #932

Containerd pods logging "Failed to initialize NVML: Unknown Error" #932

Comments

alam0rt commented Feb 19, 2025 • edited Loading

Versions and stuff

alam0rt commented Feb 19, 2025

alam0rt commented Feb 19, 2025

alam0rt commented Feb 19, 2025

alam0rt commented Feb 19, 2025 •

edited

Loading