You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The default GPU | NVIDIA addon does not find the correct drivers and thus containers are crashing.
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
What Should Happen Instead?
Everything should work after enabling GPU-Addon.
microk8s enable nvidia
Reproduction Steps
microk8s enable nvidia
Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
WARNING: --set-as-default-runtime is deprecated, please use --gpu-operator-toolkit-version instead
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using auto GPU driver
W1222 14:39:49.104108 1716891 warnings.go:70] unknown field "spec.daemonsets.rollingUpdate"
W1222 14:39:49.104132 1716891 warnings.go:70] unknown field "spec.daemonsets.updateStrategy"
NAME: gpu-operator
LAST DEPLOYED: Fri Dec 22 14:39:47 2023
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deployed NVIDIA GPU operator
microk8s kubectl get pods --namespace gpu-operator-resources
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
Building the report tarball
Report tarball is at /var/snap/microk8s/6089/inspection-report-20231221_170521.tar.gz
Hi @hansesm@pappacena, thanks for the extended bug report and the documented steps. How are the GPU drivers installed/built on the systems in question?
The gpu-operator will attempt to install the driver at /run/nvidia/driver if no driver is loaded already. The steps above look like an installation where the gpu-operator installed the driver, but then you switched to use the drivers from the host instead. The linked issue seems to describe the same problem.
An easier approach to this, ensuring that the host driver is used (if available) would be to enable the addon like this, depending on your scenario:
# make sure that host drivers are used
microk8s enable nvidia --gpu-operator-driver=host
# make sure that the operators builds and installs the nvidia drivers
microk8s enable nvidia --gpu-operator-driver=operator
Hope this helps! Can you try this on a clean system and report back? Thanks!
Summary
The default GPU | NVIDIA addon does not find the correct drivers and thus containers are crashing.
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
What Should Happen Instead?
Everything should work after enabling GPU-Addon.
microk8s enable nvidia
Reproduction Steps
microk8s enable nvidia
microk8s kubectl get pods --namespace gpu-operator-resources
microk8s kubectl describe pod nvidia-operator-validator-hxfbf -n gpu-operator-resources
nvidia-smi
ls -la /run/nvidia/driver
cat /etc/docker/daemon.json
cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
microk8s inspect
microk8s kubectl describe clusterpolicies --all-namespaces
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Can you suggest a fix?
Change values in:
/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
root = "/run/nvidia/driver"
to
root = "/"
/usr/local/nvidia/toolkit/nvidia-container-runtime
added:
"runtimes":
{
"nvidia": {
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime",
"runtimeArgs": [] }
}
Added symlink:
ln -s /sbin /run/nvidia/driver/sbin
restart k8s
microk8s stop
microk8s start
Then all containers are starting up correctly !
Best regards !
EDIT:
Found following issue containing the same issue:
NVIDIA/gpu-operator#511
The text was updated successfully, but these errors were encountered: