Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd #322

zvier · 2022-07-12T00:25:34Z

1. Issue or feature description

After change the k8s container runtime from docker to containerd, we execute nvidia-smi in a k8s GPU POD, it returns error with Failed to initialize NVML: Unknown Error and the pod cannot work well.

2. Steps to reproduce the issue

I configured my containerd referenced https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2. The containerd diff config is:

--- config.toml 2020-12-17 19:13:03.242630735 +0000
+++ /etc/containerd/config.toml 2020-12-17 19:27:02.019027793 +0000
@@ -70,7 +70,7 @@
   ignore_image_defined_volumes = false
   [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
-      default_runtime_name = "runc"
+      default_runtime_name = "nvidia"
      no_pivot = false
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
@@ -94,6 +94,15 @@
         privileged_without_host_devices = false
         base_runtime_spec = ""
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
+            SystemdCgroup = true
+       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+          privileged_without_host_devices = false
+          runtime_engine = ""
+          runtime_root = ""
+          runtime_type = "io.containerd.runc.v1"
+          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+            BinaryName = "/usr/bin/nvidia-container-runtime"
+            SystemdCgroup = true

Then, I run the base test case with ctr command, it passed and return expectly.

ctr image pull docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04  
ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

When created the GPU pod from k8s, the pod alos can running, but execute nvidia-smi in pod it returns error with Failed to initialize NVML: Unknown Error. The test pod yaml is:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04"
    command:
      - sleep
      - "3600"
    resources:
      limits:
         nvidia.com/gpu: 1
  nodeName: test-node

3. Information to attach (optional if deemed irrelevant)

I think the nvidia config in my host is right. the only change is the container runtime we use containerd directly instead of docker. And if we used docker as runtime it works well.

Common error checking:

The k8s-device-plugin container logs

crictl logs 90969408d45c6
2022/07/11 23:39:21 Loading NVML
2022/07/11 23:39:21 Starting FS watcher.
2022/07/11 23:39:21 Starting OS watcher.
2022/07/11 23:39:21 Retreiving plugins.
2022/07/11 23:39:21 Starting GRPC server for 'nvidia.com/gpu'
2022/07/11 23:39:21 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia.sock
2022/07/11 23:39:21 Registered device plugin for 'nvidia.com/gpu' with Kubelet

Additional information that might help better understand your environment and reproduce the bug:

Containerd version from containerd -v
1.6.5
Kernel version from uname -a
4.18.0-2.4.3

The text was updated successfully, but these errors were encountered:

elezar · 2022-07-12T08:28:48Z

Note that the following command doesn't use the same code path for injecting GPUs as what K8s does.

ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

Would it be possible to test this with nerdctl instead? or ensure that the RUNTIME is set instead of using the --gpus 0 flag?

Also, could you provide information on the version of the device plugin you are using, the driver version, and the version of the NVIDIA Container Toolkit.

zvier · 2022-07-15T06:28:18Z

Note that the following command doesn't use the same code path for injecting GPUs as what K8s does.
ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi
Would it be possible to test this with nerdctl instead? or ensure that the RUNTIME is set instead of using the --gpus 0 flag?

Also, could you provide information on the version of the device plugin you are using, the driver version, and the version of the NVIDIA Container Toolkit.

Two test case for above suggestions.

Use nerdctl instead of ctr

nerdctl run --network=host --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

No devices were found

Use --runtime io.containerd.runc.v1 instead of --gpus 0

ctr run --runtime io.containerd.runc.v1 --rm  -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown

which nvidia-smi
/bin/nvidia-smi

Device plugin:

nvidia-k8s-device-plugin:1.0.0-beta6

NVIDIA packages version:

rpm -qa '*nvidia*'
libnvidia-container-tools-1.3.1-1.x86_64
nvidia-container-runtime-3.4.0-1.x86_64
libnvidia-container1-1.3.1-1.x86_64
nvidia-docker2-2.5.0-1.noarch
nvidia-container-toolkit-1.4.0-2.x86_64

NVIDIA container library version:

nvidia-container-cli -V
version: 1.3.1
build date: 2020-12-14T14:18+0000
build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

zvier · 2022-07-15T08:44:42Z

We strace the nvidia-smi process in the container and found access the /dev/nvidiactl device not permitted.

elezar · 2022-07-15T11:01:37Z

@zvier those are very old versions for all the packages and the device plugin.

Would you be able to try with the latest versions:

nvidia-container-toolkit, libnvidia-container-tools, and libnvidia-container1: v1.10.0
device-plugin: v0.12.0

zvier · 2022-07-15T12:31:03Z

@zvier those are very old versions for all the packages and the device plugin.

Would you be able to try with the latest versions:

nvidia-container-toolkit, libnvidia-container-tools, and libnvidia-container1: v1.10.0

device-plugin: v0.12.0

Also not work well. But it can work if I add securityContext filed in my pod yaml like this:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04"
    command:
      - sleep
      - "36000"
    resources:
      limits:
         nvidia.com/gpu: 1
    securityContext:
      privileged: true
  nodeName: test-node-1

elezar · 2022-07-15T12:44:20Z

So to summarise. If you update the versions to the latest AND run the test pod in privileged then you're able to run nvidia-smi in the container.

This is expected since this would mount all of /dev/nv* into the container regardless and would then avoid the permission errors on /dev/nvidiactl.

Could you enable debug output for the nvidia-container-cli by uncommenting the #debug = lines in /etc/nvidia-contianer-runtime/config.toml and then including the output from /var/log/nvidia-container-toolkit.log here?

You should also be able to use ctr directly in this case by running something like:

sudo ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
    cuda-11.0.3-base-ubuntu20.04 nvidia-smi

(note how the runc-binary is set to the nvidia-container-runtime).

zvier · 2022-07-15T23:23:15Z

ctr run --rm -t
--runc-binary=/usr/bin/nvidia-container-runtime
--env NVIDIA_VISIBLE_DEVICES=all
docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04
cuda-11.0.3-base-ubuntu20.04 nvidia-smi

If test the pod with privileged, update nvidia versions is no needed.

After uncommenting the #debug = lines in /etc/nvidia-contianer-runtime/config.toml and run ctr run command, it print ok. The output of /var/log/nvidia-container-toolkit.log is:

{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/default/cuda-11.0.3-base-ubuntu20.04/config.json","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:45+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:45+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:49+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:49+08:00"}

If my container runtime is containerd, the /etc/nvidia-container-runtime/config.toml is:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
    "docker-runc",
    "runc",
]

mode = "auto"

    [nvidia-container-runtime.modes.csv]

    mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

If my container runtime is dockerd, the /etc/nvidia-container-runtime/config.toml is:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

yangfeiyu20102011 · 2022-08-15T08:30:32Z

@elezar Hi，I have encountered the similar problem~
The permissions of /dev/nvidia* are 'rw', but nvidia-smi failed.

I find that the permissions in devices.list are not right.

I try to use root to echo "c 195:* rwm" > /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf9413023_9640_4bd8_b76f_b1b629642012.slice/cri-containerd-c33389a1c755d1d6fe2de531890db4bc5e821e41646ac6d2ff7aa83662f00c9e.scope/devices.allow

/sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf9413023_9640_4bd8_b76f_b1b629642012.slice/cri-containerd-c33389a1c755d1d6fe2de531890db4bc5e821e41646ac6d2ff7aa83662f00c9e.scope/devices.list changed as expected.

But after a moment, the devices.list restored.Maybe it's the problem.
Kubelet and containerd may update the cgroup devices at regular intervals.
How to solve it?
Thanks!

klueska · 2022-08-15T09:15:29Z

Are you running the plugin with the --pass-device-specs option? This flag was designed to avoid this exact issue: https://github.com/NVIDIA/k8s-device-plugin#as-command-line-flags-or-envvars

yangfeiyu20102011 · 2022-08-22T13:32:10Z

Are you running the plugin with the --pass-device-specs option? This flag was designed to avoid this exact issue: https://github.com/NVIDIA/k8s-device-plugin#as-command-line-flags-or-envvars

I find that runc update may also change the devices.list.

setUnitProperties(m.dbus, unitName, properties...) changes the devices.list through systemd.
The properties are made by genV1ResourcesProperties, deviceProperties in properties will include entry.Path = fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor)

/dev/nvidiactl can not be found in /dev/char/195:255, the right format should be DeviceAllow=/dev/char/195:255 rw

I wan to make a PR to runc like this
`
// "_ n:m _" rules are just a path in /dev/{block,char}/.
switch rule.Type {
case devices.BlockDevice:
entry.Path = fmt.Sprintf("/dev/block/%d:%d", rule.Major, rule.Minor)
case devices.CharDevice:
entry.Path = getCharEntryPath(rule)
}

func isNVIDIADevice(rule *devices.Rule) bool {
// NVIDIA device has major 195 and 507
if rule.Major == 195 || rule.Major == 507 {
return true
}
return false
}

func getNVIDIAEntryPath(rule *devices.Rule) string {
str := "/dev/"
switch rule.Major {
case 195:
switch rule.Minor {
case 254:
str = str + "nvidia-modeset"
case 255:
str = str + "nvidiactl"
default:
str = str + "nvidia" + strconv.Itoa(int(rule.Minor))
}
case 507:
switch rule.Minor {
case 0:
str = str + "nvidia-uvm"
case 1:
str = str + "nvidia-uvm-tools"
}
}
return str
}

func getCharEntryPath(rule *devices.Rule) string {
if isNVIDIADevice(rule) {
return getNVIDIAEntryPath(rule)
}
return fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor)
}
`

Do you meet the same problem?
Thank you! @klueska

gwgrisk · 2023-04-20T01:05:42Z

@klueska Hi,I have encountered the same problem. I used the command cat /var/lib/kubelet/cpu_manager_state and got the following output:

{"policyName":"none","defaultCpuSet":"","checksum":1353318690}

Does this mean that the issue with the cpuset does not exist, and therefore it is not necessary to pass the PASS_DEVICE_SPECS parameter when starting?

zvier · 2023-05-03T00:36:08Z

This PR has fixed this problem.

elezar · 2023-05-03T09:42:36Z

Thanks for the confirmation @zvier.

@gwgrisk Note that with newer versions of systemd and using systemd cgroup management, it is also required to specify the PASS_DEVICE_SPECS option. It is thus no longer limited to interactions with GPU manager since any systemd reload will trigger a container to lose access to the underlying device nodes in this case.

github-actions · 2024-02-28T04:25:55Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd #322

Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd #322

zvier commented Jul 12, 2022 •

edited

Loading

elezar commented Jul 12, 2022 •

edited

Loading

zvier commented Jul 15, 2022 •

edited

Loading

zvier commented Jul 15, 2022

elezar commented Jul 15, 2022

zvier commented Jul 15, 2022

elezar commented Jul 15, 2022

zvier commented Jul 15, 2022

yangfeiyu20102011 commented Aug 15, 2022

klueska commented Aug 15, 2022

yangfeiyu20102011 commented Aug 22, 2022

gwgrisk commented Apr 20, 2023

zvier commented May 3, 2023

elezar commented May 3, 2023

github-actions bot commented Feb 28, 2024

Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd #322

Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd #322

Comments

zvier commented Jul 12, 2022 • edited Loading

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

elezar commented Jul 12, 2022 • edited Loading

zvier commented Jul 15, 2022 • edited Loading

zvier commented Jul 15, 2022

elezar commented Jul 15, 2022

zvier commented Jul 15, 2022

elezar commented Jul 15, 2022

zvier commented Jul 15, 2022

yangfeiyu20102011 commented Aug 15, 2022

klueska commented Aug 15, 2022

yangfeiyu20102011 commented Aug 22, 2022

gwgrisk commented Apr 20, 2023

zvier commented May 3, 2023

elezar commented May 3, 2023

github-actions bot commented Feb 28, 2024

zvier commented Jul 12, 2022 •

edited

Loading

elezar commented Jul 12, 2022 •

edited

Loading

zvier commented Jul 15, 2022 •

edited

Loading