-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd #322
Comments
Note that the following command doesn't use the same code path for injecting GPUs as what K8s does.
Would it be possible to test this with Also, could you provide information on the version of the device plugin you are using, the driver version, and the version of the NVIDIA Container Toolkit. |
Two test case for above suggestions.
Device plugin:
NVIDIA packages version:
NVIDIA container library version:
|
Also not work well. But it can work if I add
|
So to summarise. If you update the versions to the latest AND run the test pod in This is expected since this would mount all of Could you enable debug output for the You should also be able to use
(note how the |
If test the pod with After uncommenting the
If my container runtime is containerd, the
If my container runtime is dockerd, the
|
@elezar Hi,I have encountered the similar problem~
But after a moment, the devices.list restored.Maybe it's the problem. |
Are you running the plugin with the |
I find that I wan to make a PR to runc like this func isNVIDIADevice(rule *devices.Rule) bool { func getNVIDIAEntryPath(rule *devices.Rule) string { func getCharEntryPath(rule *devices.Rule) string { Do you meet the same problem? |
@klueska Hi,I have encountered the same problem. I used the command cat /var/lib/kubelet/cpu_manager_state and got the following output: {"policyName":"none","defaultCpuSet":"","checksum":1353318690} Does this mean that the issue with the cpuset does not exist, and therefore it is not necessary to pass the PASS_DEVICE_SPECS parameter when starting? |
This PR has fixed this problem. |
Thanks for the confirmation @zvier. @gwgrisk Note that with newer versions of systemd and using systemd cgroup management, it is also required to specify the |
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. |
1. Issue or feature description
After change the k8s container runtime from docker to containerd, we execute
nvidia-smi
in a k8s GPU POD, it returns error withFailed to initialize NVML: Unknown Error
and the pod cannot work well.2. Steps to reproduce the issue
I configured my containerd referenced https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2. The containerd diff config is:
Then, I run the base test case with ctr command, it passed and return expectly.
When created the GPU pod from k8s, the pod alos can running, but execute
nvidia-smi
in pod it returns error withFailed to initialize NVML: Unknown Error
. The test pod yaml is:3. Information to attach (optional if deemed irrelevant)
I think the nvidia config in my host is right. the only change is the container runtime we use containerd directly instead of docker. And if we used docker as runtime it works well.
Common error checking:
Additional information that might help better understand your environment and reproduce the bug:
containerd -v
1.6.5
uname -a
4.18.0-2.4.3
The text was updated successfully, but these errors were encountered: