Issues under WSL2 #1

qingfengfenga · 2024-04-14T10:40:46Z

Issues

Thank you very much for your work. I have attempted to run K3S and CUDA workloads under WSL2. Based on this issue and the files provided by your repository, I have conducted testing and I feel that it is almost successful.

The current issue is that the nvidia device plugin pod can execute nvidia smi, but the logs indicate that the graphics card cannot be recognized.

I suspect it may be due to the unavailability of NVCC. Do you have any ideas?

System：Win11 23H2
Runtime：Docker Desktop 4.28.0 (139021)

nvidia-device-plugin log

$ kubectl logs nvidia-device-plugin-daemonset-vvpkz -n kube-system
I0414 10:00:33.522494       1 main.go:154] Starting FS watcher.
I0414 10:00:33.522555       1 main.go:161] Starting OS watcher.
I0414 10:00:33.522912       1 main.go:176] Starting Plugins.
I0414 10:00:33.522931       1 main.go:234] Loading configuration.
I0414 10:00:33.522979       1 main.go:242] Updating config with default resource matching patterns.
I0414 10:00:33.523113       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0414 10:00:33.523131       1 main.go:256] Retreiving plugins.
I0414 10:00:33.524465       1 factory.go:107] Detected NVML platform: found NVML library
I0414 10:00:33.524495       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0414 10:00:33.541706       1 main.go:287] No devices found. Waiting indefinitely.

nvidia-device-plugin pod run nvidia-smi / nvcc

root@nvidia-device-plugin-daemonset-t68w2:/# nvidia-smi
Sun Apr 14 10:25:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.73.01              Driver Version: 552.12         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:01:00.0 Off |                  N/A |
| 31%   28C    P8             16W /  250W |    1515MiB /  22528MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@nvidia-device-plugin-daemonset-t68w2:/#
root@nvidia-device-plugin-daemonset-t68w2:/# nvcc -V
bash: nvcc: command not found
root@nvidia-device-plugin-daemonset-t68w2:/#

cuda-vector-add pod describe

$ kubectl describe pod cuda-vector-add
Name:             cuda-vector-add
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Containers:
  cuda-vector-add:
    Image:      tingweiwu/cuda-vector-add:v0.1
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fljnc (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-fljnc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  16m                  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling  6m19s (x2 over 11m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

The text was updated successfully, but these errors were encountered:

justinthelaw · 2024-04-15T16:01:33Z

YOu may be able to do a PR with some of the modifications found here to make it work? I don't have a WSL machine to test these on (just Linux), so let me know if they work and contribute back if you can :)

qingfengfenga · 2024-04-16T08:57:29Z

After using k8s-device-plugin:0.15.0-rc.2, K3D on WLS2 can run CUDA workload normally. Perhaps a mirror using k8s-device-plugin: rc version can be provided, which would be helpful for users who cannot run Docker build

NVIDIA/k8s-device-plugin#646

justinthelaw · 2024-04-16T15:03:55Z

@qingfengfenga that's great to know! Read through the closed issue you linked and it seems promising. I'll merge this into main.

When it comes to building the image, I'll create a release candidate one in this repository, but feel free to build it for yourself locally or for your own GHCR registry. Please let me know if the pushed image or your own locally built one isn't working.

Thank you for looking into this! 😄

qingfengfenga mentioned this issue Apr 14, 2024

WSL2 - No devices found. Waiting indefinitely. NVIDIA/k8s-device-plugin#646

Closed

10 tasks

justinthelaw linked a pull request Apr 16, 2024 that will close this issue

Add nvidia-device-plugin rc version #2

Merged

justinthelaw closed this as completed in #2 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues under WSL2 #1

Issues under WSL2 #1

qingfengfenga commented Apr 14, 2024

justinthelaw commented Apr 15, 2024

qingfengfenga commented Apr 16, 2024 •

edited

Loading

justinthelaw commented Apr 16, 2024 •

edited

Loading

Issues under WSL2 #1

Issues under WSL2 #1

Comments

qingfengfenga commented Apr 14, 2024

Issues

nvidia-device-plugin log

nvidia-device-plugin pod run nvidia-smi / nvcc

cuda-vector-add pod describe

justinthelaw commented Apr 15, 2024

qingfengfenga commented Apr 16, 2024 • edited Loading

justinthelaw commented Apr 16, 2024 • edited Loading

qingfengfenga commented Apr 16, 2024 •

edited

Loading

justinthelaw commented Apr 16, 2024 •

edited

Loading