Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerd pods logging "Failed to initialize NVML: Unknown Error" #932

Open
alam0rt opened this issue Feb 19, 2025 · 3 comments
Open

Containerd pods logging "Failed to initialize NVML: Unknown Error" #932

alam0rt opened this issue Feb 19, 2025 · 3 comments

Comments

@alam0rt
Copy link
Contributor

alam0rt commented Feb 19, 2025

Hi, after switching over to the systemd driver after migrating to cgroupv2 (over cgroupfs) for both containerd + kubelet, we noticed pods which are calling nvidia-smi began failing with "Failed to initialize NVML: Unknown Error".

We are not using the GPU operator, instead we are relying on the container toolkit only.

I found this issue #48 and spent some time rejigging the suggestions:

  • I created the following udev rule 72-nvidia-dev-char.rules (as I wanted it to execute after 71-nvidia.rules since they were having a race condition where nvidia-ctk ... was trying to generate the symlinks before all the drivers were loaded (and /dev/* were unavailable, I think nvidia-uvm)
# This will create /dev/char symlinks to all device nodes
# This is required or else the nvidia device nodes will not be present
# under /dev/char and thus unavailable to containers run using the systemd driver
# This must run after the drivers have been loaded (after `71-nvidia.rules`)
# See https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", TAG+="systemd", ENV{SYSTEMD_WANTS}+="nvidia-dev-char.service"

And the service is designed to run before containerd as I was seeing instances where the symlinks were created just after containerd / pods started up.

[Unit]
Description=NVIDIA /dev/char symlink creation
Wants=syslog.target
Before=containerd.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all

[Install]
WantedBy=multi-user.target

This all works as expected (symlinks are created before containerd starts), but I am seeing that the initial pod which runs nvidia-smi fails with "Failed to initialize NVML: Unknown Error". If I kill/recreate the pod, it begins to work.

I looked into the /dev filesystem of the bad/good pod and noticed that /dev/nvidia-uvm is not present in the bad pod (sorry for the formatting, had to stat the devices.

(after / before)
Image

I also performed a crictl inspect(p) and compared both the pods and container specs, no notable differences between the two.

I also ran a bpftrace on any opening of files within the container and saw that it was getting an error on openat('/dev/nvidiactl') although the char device is mounted into the container, not sure.

I am starting to think that CDI is going to be the better approach as the existing solution seems a bit brittle, does CDI support mounting nvidia-smi into GPU enabled pods?

Versions and stuff

# apt show nvidia-compute-utils-555
Package: nvidia-compute-utils-555
Version: 555.42.06-0ubuntu1

containerd 1.7.25
Ubuntu 22.04.5 LTS

@alam0rt
Copy link
Contributor Author

alam0rt commented Feb 19, 2025

Additionally tested nerdctl

# nerdctl run -it --rm --gpus all nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi
docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04:                                    resolved       |++++++++++++++++++++++++++++++++++++++| 
index-sha256:0654b44e2515f03b811496d0e2d67e9e2b81ca1f6ed225361bb3e3bb67d22e18:    done           |++++++++++++++++++++++++++++++++++++++| 
manifest-sha256:7d8fdd2a5e96ec57bc511cda1fc749f63a70e207614b3485197fd734359937e7: done           |++++++++++++++++++++++++++++++++++++++| 
config-sha256:d13839a3c4fbd332f324c135a279e14c432e90c8a03a9cedc43ddf3858f882a7:   done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:1698c67699a3eee2a8fc185093664034bb69ab67c545ab6d976399d5500b2f44:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:c5f2ffd06d8b1667c198d4f9a780b55c86065341328ab4f59d60dc996ccd5817:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:25ad149ed3cff49ddb57ceb4418377f63c897198de1f9de7a24506397822de3e:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:ba7b66a9df40b8a1c1a41d58d7c3beaf33a50dc842190cd6a2b66e6f44c3b57b:    done           |++++++++++++++++++++++++++++++++++++++| 
elapsed: 6.0 s                                                                    total:  88.3 M (14.7 MiB/s)                                      
INFO[0006] No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4] 
INFO[0006] IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844] 
Wed Feb 19 05:27:14 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   21C    P8             10W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Working as expected.

@alam0rt
Copy link
Contributor Author

alam0rt commented Feb 19, 2025

Is there a good way to do a nerdctl run --gpu all but exclude mounting /dev/nvidia-uvm ? I tried a few variations but it seems the oci hook overwrites anything I do to it

@alam0rt alam0rt closed this as completed Feb 19, 2025
@alam0rt alam0rt reopened this Feb 19, 2025
@alam0rt
Copy link
Contributor Author

alam0rt commented Feb 19, 2025

Strange, so even after enabling CDI, I am getting the same behaviour.

Before the pod is restarted

bash-5.1# for f in /dev/*; do echo $f; done
/dev/core
/dev/fd
/dev/full
/dev/kmsg
/dev/mqueue
/dev/null
/dev/nvidia-uvm
/dev/nvidia0
/dev/nvidiactl
/dev/ptmx
/dev/pts
/dev/random
/dev/shm
/dev/stderr
/dev/stdin
/dev/stdout
/dev/termination-log
/dev/tty
/dev/urandom
/dev/zero

After

bash-5.1# for f in /dev/*; do echo $f; done                                                                                                                                                  
/dev/core
/dev/dri
/dev/fd
/dev/full
/dev/kmsg
/dev/mqueue
/dev/null
/dev/nvidia-caps
/dev/nvidia-modeset
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia0
/dev/nvidiactl
/dev/ptmx
/dev/pts
/dev/random
/dev/shm
/dev/stderr
/dev/stdin
/dev/stdout
/dev/termination-log
/dev/tty
/dev/urandom
/dev/zero

The devices which are missing in the initially run pod all have a creation timestamp about the same as the present ones. Strange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant