Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd #322

Open
3 tasks
zvier opened this issue Jul 12, 2022 · 14 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@zvier
Copy link

zvier commented Jul 12, 2022

1. Issue or feature description

After change the k8s container runtime from docker to containerd, we execute nvidia-smi in a k8s GPU POD, it returns error with Failed to initialize NVML: Unknown Error and the pod cannot work well.

2. Steps to reproduce the issue

I configured my containerd referenced https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2. The containerd diff config is:

--- config.toml 2020-12-17 19:13:03.242630735 +0000
+++ /etc/containerd/config.toml 2020-12-17 19:27:02.019027793 +0000
@@ -70,7 +70,7 @@
   ignore_image_defined_volumes = false
   [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
-      default_runtime_name = "runc"
+      default_runtime_name = "nvidia"
      no_pivot = false
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
@@ -94,6 +94,15 @@
         privileged_without_host_devices = false
         base_runtime_spec = ""
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
+            SystemdCgroup = true
+       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+          privileged_without_host_devices = false
+          runtime_engine = ""
+          runtime_root = ""
+          runtime_type = "io.containerd.runc.v1"
+          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+            BinaryName = "/usr/bin/nvidia-container-runtime"
+            SystemdCgroup = true

Then, I run the base test case with ctr command, it passed and return expectly.

ctr image pull docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04  
ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

When created the GPU pod from k8s, the pod alos can running, but execute nvidia-smi in pod it returns error with Failed to initialize NVML: Unknown Error. The test pod yaml is:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04"
    command:
      - sleep
      - "3600"
    resources:
      limits:
         nvidia.com/gpu: 1
  nodeName: test-node

3. Information to attach (optional if deemed irrelevant)

I think the nvidia config in my host is right. the only change is the container runtime we use containerd directly instead of docker. And if we used docker as runtime it works well.

Common error checking:

  • The k8s-device-plugin container logs
crictl logs 90969408d45c6
2022/07/11 23:39:21 Loading NVML
2022/07/11 23:39:21 Starting FS watcher.
2022/07/11 23:39:21 Starting OS watcher.
2022/07/11 23:39:21 Retreiving plugins.
2022/07/11 23:39:21 Starting GRPC server for 'nvidia.com/gpu'
2022/07/11 23:39:21 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia.sock
2022/07/11 23:39:21 Registered device plugin for 'nvidia.com/gpu' with Kubelet

Additional information that might help better understand your environment and reproduce the bug:

  • Containerd version from containerd -v
    1.6.5
  • Kernel version from uname -a
    4.18.0-2.4.3
@elezar
Copy link
Member

elezar commented Jul 12, 2022

Note that the following command doesn't use the same code path for injecting GPUs as what K8s does.

ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

Would it be possible to test this with nerdctl instead? or ensure that the RUNTIME is set instead of using the --gpus 0 flag?

Also, could you provide information on the version of the device plugin you are using, the driver version, and the version of the NVIDIA Container Toolkit.

@zvier
Copy link
Author

zvier commented Jul 15, 2022

Note that the following command doesn't use the same code path for injecting GPUs as what K8s does.

ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

Would it be possible to test this with nerdctl instead? or ensure that the RUNTIME is set instead of using the --gpus 0 flag?

Also, could you provide information on the version of the device plugin you are using, the driver version, and the version of the NVIDIA Container Toolkit.

Two test case for above suggestions.

  1. Use nerdctl instead of ctr
nerdctl run --network=host --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

No devices were found
  1. Use --runtime io.containerd.runc.v1 instead of --gpus 0
ctr run --runtime io.containerd.runc.v1 --rm  -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi

ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown

which nvidia-smi
/bin/nvidia-smi

Device plugin:

nvidia-k8s-device-plugin:1.0.0-beta6

NVIDIA packages version:

rpm -qa '*nvidia*'
libnvidia-container-tools-1.3.1-1.x86_64
nvidia-container-runtime-3.4.0-1.x86_64
libnvidia-container1-1.3.1-1.x86_64
nvidia-docker2-2.5.0-1.noarch
nvidia-container-toolkit-1.4.0-2.x86_64

NVIDIA container library version:

nvidia-container-cli -V
version: 1.3.1
build date: 2020-12-14T14:18+0000
build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

@zvier
Copy link
Author

zvier commented Jul 15, 2022

We strace the nvidia-smi process in the container and found access the /dev/nvidiactl device not permitted.

image

@elezar
Copy link
Member

elezar commented Jul 15, 2022

@zvier those are very old versions for all the packages and the device plugin.

Would you be able to try with the latest versions:

  • nvidia-container-toolkit, libnvidia-container-tools, and libnvidia-container1: v1.10.0
  • device-plugin: v0.12.0

@zvier
Copy link
Author

zvier commented Jul 15, 2022

@zvier those are very old versions for all the packages and the device plugin.

Would you be able to try with the latest versions:

  • nvidia-container-toolkit, libnvidia-container-tools, and libnvidia-container1: v1.10.0
  • device-plugin: v0.12.0

Also not work well. But it can work if I add securityContext filed in my pod yaml like this:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04"
    command:
      - sleep
      - "36000"
    resources:
      limits:
         nvidia.com/gpu: 1
    securityContext:
      privileged: true
  nodeName: test-node-1

@elezar
Copy link
Member

elezar commented Jul 15, 2022

So to summarise. If you update the versions to the latest AND run the test pod in privileged then you're able to run nvidia-smi in the container.

This is expected since this would mount all of /dev/nv* into the container regardless and would then avoid the permission errors on /dev/nvidiactl.

Could you enable debug output for the nvidia-container-cli by uncommenting the #debug = lines in /etc/nvidia-contianer-runtime/config.toml and then including the output from /var/log/nvidia-container-toolkit.log here?

You should also be able to use ctr directly in this case by running something like:

sudo ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
    cuda-11.0.3-base-ubuntu20.04 nvidia-smi

(note how the runc-binary is set to the nvidia-container-runtime).

@zvier
Copy link
Author

zvier commented Jul 15, 2022

ctr run --rm -t
--runc-binary=/usr/bin/nvidia-container-runtime
--env NVIDIA_VISIBLE_DEVICES=all
docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04
cuda-11.0.3-base-ubuntu20.04 nvidia-smi

If test the pod with privileged, update nvidia versions is no needed.

After uncommenting the #debug = lines in /etc/nvidia-contianer-runtime/config.toml and run ctr run command, it print ok. The output of /var/log/nvidia-container-toolkit.log is:

{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:39+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/default/cuda-11.0.3-base-ubuntu20.04/config.json","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Applied required modification to OCI specification","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Forwarding command to runtime","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:44+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:45+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:45+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:49+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-16T07:17:49+08:00"}

If my container runtime is containerd, the /etc/nvidia-container-runtime/config.toml is:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
    "docker-runc",
    "runc",
]

mode = "auto"

    [nvidia-container-runtime.modes.csv]

    mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

If my container runtime is dockerd, the /etc/nvidia-container-runtime/config.toml is:

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

@yangfeiyu20102011
Copy link

@elezar Hi,I have encountered the similar problem~
The permissions of /dev/nvidia* are 'rw', but nvidia-smi failed.
image
I find that the permissions in devices.list are not right.
image
I try to use root to echo "c 195:* rwm" > /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf9413023_9640_4bd8_b76f_b1b629642012.slice/cri-containerd-c33389a1c755d1d6fe2de531890db4bc5e821e41646ac6d2ff7aa83662f00c9e.scope/devices.allow

/sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf9413023_9640_4bd8_b76f_b1b629642012.slice/cri-containerd-c33389a1c755d1d6fe2de531890db4bc5e821e41646ac6d2ff7aa83662f00c9e.scope/devices.list changed as expected.
image

But after a moment, the devices.list restored.Maybe it's the problem.
Kubelet and containerd may update the cgroup devices at regular intervals.
How to solve it?
Thanks!

@klueska
Copy link
Contributor

klueska commented Aug 15, 2022

Are you running the plugin with the --pass-device-specs option? This flag was designed to avoid this exact issue: https://github.com/NVIDIA/k8s-device-plugin#as-command-line-flags-or-envvars

@yangfeiyu20102011
Copy link

Are you running the plugin with the --pass-device-specs option? This flag was designed to avoid this exact issue: https://github.com/NVIDIA/k8s-device-plugin#as-command-line-flags-or-envvars

I find that runc update may also change the devices.list.
image
setUnitProperties(m.dbus, unitName, properties...) changes the devices.list through systemd.
The properties are made by genV1ResourcesProperties, deviceProperties in properties will include entry.Path = fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor)
image
/dev/nvidiactl can not be found in /dev/char/195:255, the right format should be DeviceAllow=/dev/char/195:255 rw
image

I wan to make a PR to runc like this
`
// "_ n:m _" rules are just a path in /dev/{block,char}/.
switch rule.Type {
case devices.BlockDevice:
entry.Path = fmt.Sprintf("/dev/block/%d:%d", rule.Major, rule.Minor)
case devices.CharDevice:
entry.Path = getCharEntryPath(rule)
}

func isNVIDIADevice(rule *devices.Rule) bool {
// NVIDIA device has major 195 and 507
if rule.Major == 195 || rule.Major == 507 {
return true
}
return false
}

func getNVIDIAEntryPath(rule *devices.Rule) string {
str := "/dev/"
switch rule.Major {
case 195:
switch rule.Minor {
case 254:
str = str + "nvidia-modeset"
case 255:
str = str + "nvidiactl"
default:
str = str + "nvidia" + strconv.Itoa(int(rule.Minor))
}
case 507:
switch rule.Minor {
case 0:
str = str + "nvidia-uvm"
case 1:
str = str + "nvidia-uvm-tools"
}
}
return str
}

func getCharEntryPath(rule *devices.Rule) string {
if isNVIDIADevice(rule) {
return getNVIDIAEntryPath(rule)
}
return fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor)
}
`

Do you meet the same problem?
Thank you! @klueska

@gwgrisk
Copy link

gwgrisk commented Apr 20, 2023

@klueska Hi,I have encountered the same problem. I used the command cat /var/lib/kubelet/cpu_manager_state and got the following output:

{"policyName":"none","defaultCpuSet":"","checksum":1353318690}

Does this mean that the issue with the cpuset does not exist, and therefore it is not necessary to pass the PASS_DEVICE_SPECS parameter when starting?

@zvier
Copy link
Author

zvier commented May 3, 2023

This PR has fixed this problem.

@elezar
Copy link
Member

elezar commented May 3, 2023

Thanks for the confirmation @zvier.

@gwgrisk Note that with newer versions of systemd and using systemd cgroup management, it is also required to specify the PASS_DEVICE_SPECS option. It is thus no longer limited to interactions with GPU manager since any systemd reload will trigger a container to lose access to the underlying device nodes in this case.

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants