-
-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use GPU with k3d #1108
Comments
First of all, thanks for sharing this. It worked for me, I also tested it we the last version of k3s: I didn't explore the differences in detail but the manifest device-plugin-daemonset.yaml in the k3d docs seems outdated compared to the nvidia's github:
|
Thanks @arikmaor for posting this. Original
Your config fixed it! Can I ask what is the source of your |
I have been having issues with the nvidia-device-plugin crashing for me. Any tips for seeing why it crashed, I am having a hard time finding output. This is also happening on all the machines I have tested (2 so far). |
The config.toml.tmpl file is based on the original template and the only change is adding:
|
|
Adding the line seems to cause the following:
I tried to search for some related topics --- anyone else encounters similar? Okey I find it: #658 (comment) Make sure you have every versions aligned. :) |
Thanks arikmaor, your comment was super helpful. |
Thank you @arikmaor! I succeeded only after applying your modifications. My setup:
|
@all-contributors |
I've updated the pull request to add @arikmaor! 🎉 |
@arikmaor I notice that you've put the |
Take a look at the original template, the
I simply added |
It is worth noting that the original template file has changed, and I'm not sure of the implications. I believe now we need to set |
Yeah, this situation is honestly a bit maddening, since overwriting this whole configuration file just for a couple of lines while there is some amount of auto-configuration already happening feels a bit overkill. However, I've become interested in I've tested this setup with the usual
Should we try to consolidate all of this information into a PR? |
I totally agree My approach to getting GPU workloads working was based on the guide in k3d documentation. When it didn't just work, I found a way to make it work with just a few tweaks and described it in this issue. I like your idea because right now, when The problem I see with this solution is that setting
|
I'm still unable to make any of this work. It would be so much simpler if someone was kind enough to push the custom build k3d docker image that supports GPUs :( |
On a somewhat connected note: the docs mention that this whole GPU configuration does not work on WSL2. Currently it's still somewhat true, but the solution is on the horizon: there is an open merge request on the |
I've upgraded the original comment (and tested)
|
FYI: the nvidia-container-runtime project has been superseded by the NVIDIA Container Toolkit. the NVIDIA_CONTAINER_RUNTIME_VERSION parameter become obsolete |
Hey all, I made an attempt to replicate what @arikmaor had done and I have it working on our Lambda cloud instance. Also took the opportunity to provide more updated nvidia/cuda, nvidia device plugin, and k3s deps, and created a better build script. The Dockerfile has some notable changes to improve the approach, and its bugs, a little bit. The image is located below and should work for k3d GPU Support image: https://github.com/justinthelaw/k3d-gpu-support/pkgs/container/k3d-gpu-support Please let me know if there are any fixes or issues. Would love some feedback and also support for other future use cases (like ROCm, Metal, etc.) involving k3d. I am still new to all of this. Important note: the CUDA version must match your host system or containerized NVIDIA drivers. E.g., our Lambda instance has 535 installed, with max CUDA of 12.2.x, so this image has the base image set to 12.2.0 so that it is compatible. |
Took a stab at gathering all the info in this thread and submitted a PR with updated CUDA documentation. I took the suggestion from the k3s docs (also mentioned here by @rassie) and added a RuntimeClass definition and I did not run into any issues on my setup, but I only did some basic testing, so feedback is appreciated. It appears to work fine in WSL2 as well. Updated files available in this repo: https://github.com/dbreyfogle/k3d-nvidia-runtime |
Docs updated in #1430 - thanks @dbreyfogle and everyone in this thread for providing the information! |
I've just managed to get
k3d
running with gpu support and it took a lot of effort getting this to work.The documentation was not updated for along time and most of the information is scattered around many PRs, Issues and medium articles.
I'm gonna describe what I did and I hope you can update the docs.
What should be installed on the host?
Based on the nvidia installation guide
Nvidia Drivers
nvidia-runtime package
There is a lot of confusion on that part as there are many similar packages
You want
nvidia-docker2
Check that the host is configured correctly
if you see the
nvidia-smi
output with your graphics card name then it is configured correctly!The custom k3s image
The custom image in
k3d
current documentation requires the following tweaks:COPY --from=k3s / /
, doCOPY --from=k3s /bin /bin
etc
dir:COPY --from=k3s /etc /etc
(I'm not sure this is a must)ENV CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml
config.toml.tmpl
file that is widely suggested causes an error for all the pods that is related tocgroups
, I've managed to solve this by creating a new file based on the original template with a simple addition ofdefault_runtime_name = "nvidia"
under[plugins.cri.containerd]
Docker file:
config.toml.tmpl
The text was updated successfully, but these errors were encountered: