WSL2 Support #318

mchikyt3 · 2022-02-02T10:06:26Z

Hi, I wonder if it's possible to use the gpu-operator in a single-node Microk8s cluster hosted on a wsl2 Ubuntu distribution. Thanks.

shivamerla · 2022-03-25T19:23:58Z

@elezar to comment if this is supported by our container-toolkit.

elezar · 2022-03-28T04:58:22Z

Hi @mchikyt3 the combination you mention is untested by us, and I cannot provide a concrete answer.

The NVIDIA Container Toolkit which ensures that a launched container includes the required devices and libraries to use GPUs in the container does offer some support WSL2. It should be noted however, that there may be some use cases that do not work as expected.

Also note that I am not sure whether the other operands such as GPU Feature Discovery or the NVIDIA Device Plugin will function as expected.

valxv · 2023-07-07T10:32:54Z

It appears that GPU Feature Discovery does not work properly. @elezar, are there any plans to address this? I have no problems running CUDA code inside Docker containers on WSL2 with Docker, podman, but it doesn't work with several Kubernetes distributions I tried. I posted several logs from my laptop on this MicroK8s thread and would be grateful if someone could help me to solve this issue.

Maybe the problem could be solved by creation of a couple of symlinks.

wizpresso-steve-cy-fan · 2023-07-25T06:42:39Z

Can someone fix this?

(combined from similar events): Error: failed to generate container "0520a1a018b798ce299be6171c3daa405d549219457b6c1e42cb1774b1b92e9e" spec: failed to generate spec: path "/" is mounted on "/" but it is not a shared or slave mount

gpu-operator/assets/state-container-toolkit/0500_daemonset.yaml

Lines 110 to 112 in 2f0a166

    
           - name: host-root 
        
             hostPath: 
        
               path: /

This is not working in WSL2, I confirmed this on k0s

EDIT: fixed, just run mount --make-rshared /

wizpresso-steve-cy-fan · 2023-10-06T09:27:09Z

Also make sure you edit the labels to cheat the GPU operator on the specific WSL2 node:

    feature.node.kubernetes.io/pci-10de.present: 'true'
    nvidia.com/device-plugin.config: RTX-4070-Ti # needed because GFD is not available
    nvidia.com/gpu.count: '1'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true' # optional
    nvidia.com/gpu.deploy.dcgm-exporter: 'true' # optional
    nvidia.com/gpu.deploy.device-plugin: 'true' 
    nvidia.com/gpu.deploy.driver: 'false' # need special treatments
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'false' # incompatible with WSL2
    nvidia.com/gpu.deploy.node-status-exporter: 'false' # optional
    nvidia.com/gpu.deploy.operator-validator: 'true'
    nvidia.com/gpu.present: 'true'
    nvidia.com/gpu.replicas: '16'

You can either auto-insert those labels if you use k0sctl or add them manually once the node is onboarded.

The drivers and container-toolkit are technically optional as WSL2 already installed all the prerequisites...But we still need to cheat the system.
We will need to make the following files:

$ touch /run/nvidia/validations/host-driver-ready
$ touch /run/nvidia/validations/toolkit-ready # if you skipped validator
$ touch /run/nvidia/validations/cuda-ready # if you skipped validator
$ touch /run/nvidia/validations/plugin-ready # if you skipped validator

So that we could effectively bypass the GPU operator checkings, then the GPU operator will finally register the node to be compatible with nvidia runtime and runs it. You can try to use a DaemonSet script to do that.

It is also noted if you have preinstalled drivers then you don't need to touch the files in any case. But then you need to figure out how to pass the condition.

I'm using k0s in my company local cluster under WSL2, but this should apply to all k8s distributions that runs under WSL2.

By the way this is the Helm install for GPU operator that should work on k0s:

cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoSchedule
    key: k8s.wizpresso.com/wsl-node
    operator: Exists
devicePlugin:
  config:
    name: time-slicing-config
driver:
  enabled: true
operator:
  defaultRuntime: containerd
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_CONFIG
    value: /etc/k0s/containerd.d/nvidia.toml
  - name: CONTAINERD_SOCKET
    value: /run/k0s/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"

AntonOfTheWoods · 2023-11-05T06:49:08Z

@wizpresso-steve-cy-fan , you wouldn't happen to have an install doc or script you could share for getting k0s set up on wsl2 by any chance?

wizpresso-steve-cy-fan · 2023-11-06T03:15:52Z

@AntonOfTheWoods let me push the changes to GitLab first
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881
https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/481

alexeadem · 2024-01-30T20:29:17Z

@AntonOfTheWoods see comment for full instructions on how to make this work locally

Windows 11
WSL2
Docker cgroup v2
Nvidia GPU operator
Kubeflow

on kind or qbo Kubernetes

cbrendanprice · 2024-01-30T23:12:12Z

@alexeadem thanks so much for this! i was dragging my feet to create images for wizpresso-steve-cy-fan's prs so this saved me some time.

i am curious to know if you've had success with running a cuda workload with this implemented? i am able to successfully get the gpu operator helm chart running with these values:

cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
devicePlugin:
  image: k8s-device-plugin
  repository: eadem
  version: v0.14.3-ubuntu20.04
driver:
  enabled: true
operator:
  defaultRuntime: containerd
  image: gpu-operator
  repository: eadem
  version: v23.9.1-ubi8
runtimeClassName: "nvidia"
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"
  image: container-toolkit
  repository: eadem
  version: 1.14.3-ubuntu20.04
validator:
  driver:
    env:
    - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
      value: "true"
  image: gpu-operator-validator
  repository: eadem
  version: v23.9.1-ubi8

and my pods are now successfully getting past preemption when specifying gpu limits, however, when i try to run a gpu workload (e.g. nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04)) it fails to run with the error:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

just curious to know if you're having this problem or not.

oh, and not that it matters much, but just a heads up that the custom docker image you linked for the operator in your comment is actually linking to your custom validator image.

thanks again!

alexeadem · 2024-01-30T23:39:05Z

@alexeadem thanks so much for this! i was dragging my feet to create images for wizpresso-steve-cy-fan's prs so this saved me some time.

i am curious to know if you've had success with running a cuda workload with this implemented? i am able to successfully get the gpu operator helm chart running with these values:
cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
devicePlugin:
  image: k8s-device-plugin
  repository: eadem
  version: v0.14.3-ubuntu20.04
driver:
  enabled: true
operator:
  defaultRuntime: containerd
  image: gpu-operator
  repository: eadem
  version: v23.9.1-ubi8
runtimeClassName: "nvidia"
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"
  image: container-toolkit
  repository: eadem
  version: 1.14.3-ubuntu20.04
validator:
  driver:
    env:
    - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
      value: "true"
  image: gpu-operator-validator
  repository: eadem
  version: v23.9.1-ubi8
and my pods are now successfully getting past preemption when specifying gpu limits, however, when i try to run a gpu workload (e.g. nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04)) it fails to run with the error:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

just curious to know if you're having this problem or not.

oh, and not that it matters much, but just a heads up that the custom docker image you linked for the operator in your comment is actually linking to your custom validator image.

thanks again!

np @cbrendanprice Thanks for the docker links. I fixed it.

Nvidia driver, cuda, toolkit and operator are pretty tight together when it comes to versions. That error should be easily fixed by using the right versions. Here is an example of the version needed for cuda 12.2 and a full example of a cuda workload in kubeflow and directly into a pod in the operator. And I don't see that error with eadem images.

https://ce.qbo.io/#/ai_and_ml

Try this one instead. The link you provided is an old version
https://ce.qbo.io/#/ai_and_ml?id=_3-deploy-vector-add

Azerothian · 2025-01-12T02:56:19Z

System: Ubuntu 24.04.1 LTS
kernel: 5.15.167.4-microsoft-standard-WSL2+
RKE2 Version: v1.30.6+rke2r1

If anyone is trying to use RKE2 as a base, I found that it is forcing the nvidia runtime to use SystemCgroups = true in its container config. But my guess this seems to be failing back to cgroups v1?

see rancher/rke2#7571

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WSL2 Support #318

WSL2 Support #318

mchikyt3 commented Feb 2, 2022

shivamerla commented Mar 25, 2022

elezar commented Mar 28, 2022

valxv commented Jul 7, 2023 •

edited

Loading

wizpresso-steve-cy-fan commented Jul 25, 2023 •

edited

Loading

wizpresso-steve-cy-fan commented Oct 6, 2023 •

edited

Loading

AntonOfTheWoods commented Nov 5, 2023

wizpresso-steve-cy-fan commented Nov 6, 2023 •

edited

Loading

alexeadem commented Jan 30, 2024 •

edited

Loading

cbrendanprice commented Jan 30, 2024

alexeadem commented Jan 30, 2024 •

edited

Loading

Azerothian commented Jan 12, 2025 •

edited

Loading

WSL2 Support #318

WSL2 Support #318

Comments

mchikyt3 commented Feb 2, 2022

shivamerla commented Mar 25, 2022

elezar commented Mar 28, 2022

valxv commented Jul 7, 2023 • edited Loading

wizpresso-steve-cy-fan commented Jul 25, 2023 • edited Loading

wizpresso-steve-cy-fan commented Oct 6, 2023 • edited Loading

AntonOfTheWoods commented Nov 5, 2023

wizpresso-steve-cy-fan commented Nov 6, 2023 • edited Loading

alexeadem commented Jan 30, 2024 • edited Loading

cbrendanprice commented Jan 30, 2024

alexeadem commented Jan 30, 2024 • edited Loading

Azerothian commented Jan 12, 2025 • edited Loading

valxv commented Jul 7, 2023 •

edited

Loading

wizpresso-steve-cy-fan commented Jul 25, 2023 •

edited

Loading

wizpresso-steve-cy-fan commented Oct 6, 2023 •

edited

Loading

wizpresso-steve-cy-fan commented Nov 6, 2023 •

edited

Loading

alexeadem commented Jan 30, 2024 •

edited

Loading

alexeadem commented Jan 30, 2024 •

edited

Loading

Azerothian commented Jan 12, 2025 •

edited

Loading