Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues under WSL2 #1

Closed
qingfengfenga opened this issue Apr 14, 2024 · 3 comments · Fixed by #2
Closed

Issues under WSL2 #1

qingfengfenga opened this issue Apr 14, 2024 · 3 comments · Fixed by #2

Comments

@qingfengfenga
Copy link
Contributor

Issues

Thank you very much for your work. I have attempted to run K3S and CUDA workloads under WSL2. Based on this issue and the files provided by your repository, I have conducted testing and I feel that it is almost successful.

The current issue is that the nvidia device plugin pod can execute nvidia smi, but the logs indicate that the graphics card cannot be recognized.

I suspect it may be due to the unavailability of NVCC. Do you have any ideas?

System:Win11 23H2
Runtime:Docker Desktop 4.28.0 (139021)

nvidia-device-plugin log

$ kubectl logs nvidia-device-plugin-daemonset-vvpkz -n kube-system
I0414 10:00:33.522494       1 main.go:154] Starting FS watcher.
I0414 10:00:33.522555       1 main.go:161] Starting OS watcher.
I0414 10:00:33.522912       1 main.go:176] Starting Plugins.
I0414 10:00:33.522931       1 main.go:234] Loading configuration.
I0414 10:00:33.522979       1 main.go:242] Updating config with default resource matching patterns.
I0414 10:00:33.523113       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0414 10:00:33.523131       1 main.go:256] Retreiving plugins.
I0414 10:00:33.524465       1 factory.go:107] Detected NVML platform: found NVML library
I0414 10:00:33.524495       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0414 10:00:33.541706       1 main.go:287] No devices found. Waiting indefinitely.

nvidia-device-plugin pod run nvidia-smi / nvcc

root@nvidia-device-plugin-daemonset-t68w2:/# nvidia-smi
Sun Apr 14 10:25:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.73.01              Driver Version: 552.12         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  |   00000000:01:00.0 Off |                  N/A |
| 31%   28C    P8             16W /  250W |    1515MiB /  22528MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@nvidia-device-plugin-daemonset-t68w2:/#
root@nvidia-device-plugin-daemonset-t68w2:/# nvcc -V
bash: nvcc: command not found
root@nvidia-device-plugin-daemonset-t68w2:/#

cuda-vector-add pod describe

$ kubectl describe pod cuda-vector-add
Name:             cuda-vector-add
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Containers:
  cuda-vector-add:
    Image:      tingweiwu/cuda-vector-add:v0.1
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fljnc (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-fljnc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  16m                  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling  6m19s (x2 over 11m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
@justinthelaw
Copy link
Owner

YOu may be able to do a PR with some of the modifications found here to make it work? I don't have a WSL machine to test these on (just Linux), so let me know if they work and contribute back if you can :)

@qingfengfenga
Copy link
Contributor Author

qingfengfenga commented Apr 16, 2024

After using k8s-device-plugin:0.15.0-rc.2, K3D on WLS2 can run CUDA workload normally. Perhaps a mirror using k8s-device-plugin: rc version can be provided, which would be helpful for users who cannot run Docker build

NVIDIA/k8s-device-plugin#646

@justinthelaw justinthelaw linked a pull request Apr 16, 2024 that will close this issue
@justinthelaw
Copy link
Owner

justinthelaw commented Apr 16, 2024

@qingfengfenga that's great to know! Read through the closed issue you linked and it seems promising. I'll merge this into main.

When it comes to building the image, I'll create a release candidate one in this repository, but feel free to build it for yourself locally or for your own GHCR registry. Please let me know if the pushed image or your own locally built one isn't working.

Thank you for looking into this! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants