-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WSL2 Support #318
Comments
@elezar to comment if this is supported by our container-toolkit. |
Hi @mchikyt3 the combination you mention is untested by us, and I cannot provide a concrete answer. The NVIDIA Container Toolkit which ensures that a launched container includes the required devices and libraries to use GPUs in the container does offer some support WSL2. It should be noted however, that there may be some use cases that do not work as expected. Also note that I am not sure whether the other operands such as GPU Feature Discovery or the NVIDIA Device Plugin will function as expected. |
It appears that GPU Feature Discovery does not work properly. @elezar, are there any plans to address this? I have no problems running CUDA code inside Docker containers on WSL2 with Docker, podman, but it doesn't work with several Kubernetes distributions I tried. I posted several logs from my laptop on this MicroK8s thread and would be grateful if someone could help me to solve this issue. Maybe the problem could be solved by creation of a couple of symlinks. |
Can someone fix this?
gpu-operator/assets/state-container-toolkit/0500_daemonset.yaml Lines 110 to 112 in 2f0a166
This is not working in WSL2, I confirmed this on k0s EDIT: fixed, just run |
Also make sure you edit the labels to cheat the GPU operator on the specific WSL2 node: feature.node.kubernetes.io/pci-10de.present: 'true'
nvidia.com/device-plugin.config: RTX-4070-Ti # needed because GFD is not available
nvidia.com/gpu.count: '1'
nvidia.com/gpu.deploy.container-toolkit: 'true'
nvidia.com/gpu.deploy.dcgm: 'true' # optional
nvidia.com/gpu.deploy.dcgm-exporter: 'true' # optional
nvidia.com/gpu.deploy.device-plugin: 'true'
nvidia.com/gpu.deploy.driver: 'false' # need special treatments
nvidia.com/gpu.deploy.gpu-feature-discovery: 'false' # incompatible with WSL2
nvidia.com/gpu.deploy.node-status-exporter: 'false' # optional
nvidia.com/gpu.deploy.operator-validator: 'true'
nvidia.com/gpu.present: 'true'
nvidia.com/gpu.replicas: '16' You can either auto-insert those labels if you use k0sctl or add them manually once the node is onboarded. The drivers and container-toolkit are technically optional as WSL2 already installed all the prerequisites...But we still need to cheat the system.
So that we could effectively bypass the GPU operator checkings, then the GPU operator will finally register the node to be compatible with nvidia runtime and runs it. You can try to use a DaemonSet script to do that. It is also noted if you have preinstalled drivers then you don't need to touch the files in any case. But then you need to figure out how to pass the condition. I'm using k0s in my company local cluster under WSL2, but this should apply to all k8s distributions that runs under WSL2. By the way this is the Helm install for GPU operator that should work on k0s: cdi:
enabled: false
daemonsets:
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoSchedule
key: k8s.wizpresso.com/wsl-node
operator: Exists
devicePlugin:
config:
name: time-slicing-config
driver:
enabled: true
operator:
defaultRuntime: containerd
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /etc/k0s/containerd.d/nvidia.toml
- name: CONTAINERD_SOCKET
value: /run/k0s/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "false" |
@wizpresso-steve-cy-fan , you wouldn't happen to have an install doc or script you could share for getting k0s set up on wsl2 by any chance? |
@AntonOfTheWoods let me push the changes to GitLab first |
@AntonOfTheWoods see comment for full instructions on how to make this work locally
|
@alexeadem thanks so much for this! i was dragging my feet to create images for wizpresso-steve-cy-fan's prs so this saved me some time. i am curious to know if you've had success with running a cuda workload with this implemented? i am able to successfully get the gpu operator helm chart running with these values: cdi:
enabled: false
daemonsets:
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
devicePlugin:
image: k8s-device-plugin
repository: eadem
version: v0.14.3-ubuntu20.04
driver:
enabled: true
operator:
defaultRuntime: containerd
image: gpu-operator
repository: eadem
version: v23.9.1-ubi8
runtimeClassName: "nvidia"
toolkit:
enabled: true
env:
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "false"
image: container-toolkit
repository: eadem
version: 1.14.3-ubuntu20.04
validator:
driver:
env:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true"
image: gpu-operator-validator
repository: eadem
version: v23.9.1-ubi8 and my pods are now successfully getting past preemption when specifying gpu limits, however, when i try to run a gpu workload (e.g. nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04)) it fails to run with the error:
just curious to know if you're having this problem or not. oh, and not that it matters much, but just a heads up that the custom docker image you linked for the operator in your comment is actually linking to your custom validator image. thanks again! |
np @cbrendanprice Thanks for the docker links. I fixed it. Nvidia driver, cuda, toolkit and operator are pretty tight together when it comes to versions. That error should be easily fixed by using the right versions. Here is an example of the version needed for cuda 12.2 and a full example of a cuda workload in kubeflow and directly into a pod in the operator. And I don't see that error with eadem images. Try this one instead. The link you provided is an old version |
If anyone is trying to use RKE2 as a base, I found that it is forcing the nvidia runtime to use SystemCgroups = true in its container config. But my guess this seems to be failing back to cgroups v1? |
Hi, I wonder if it's possible to use the gpu-operator in a single-node Microk8s cluster hosted on a wsl2 Ubuntu distribution. Thanks.
The text was updated successfully, but these errors were encountered: