Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVIDIA APIs #4182

Merged
merged 4 commits into from
Sep 12, 2024
Merged

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented Sep 9, 2024

Issue number:

N / A

Description of changes:

This includes two new APIs to all the NVIDIA variants:

  • settings.nvidia-container-runtime: used to configure the NVIDIA container toolkit behavior
  • settings.kubelet-device-plugin.nvidia: used to expose configurations for the NVIDIA k8s device plugin

Testing done:

As part of bottlerocket-os/bottlerocket-settings-sdk#60 and bottlerocket-os/bottlerocket-core-kit#132

  1. Instance joined the cluster:
NAME                                           STATUS   ROLES    AGE    VERSION
ip-192-168-41-204.us-west-2.compute.internal   Ready    <none>   22s    v1.30.1-eks-e564799
  1. The safe defaults were used, which prevented the containers from accessing all the GPUs:
└─> ❯ k exec safe-defaults-d7g2j -- env | rg NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES=all
└─> ❯ k exec safe-defaults-d7g2j -- cat /proc/self/mountinfo | rg nvidia
┌───────────────────> ~/Code/work/by-feature/nvidia-settings-api/testing on Fedora
└─> ❯
  1. Files were generated using the new APIs
bash-5.1# cat /etc/systemd/system/nvidia-k8s-device-plugin.service.d/exec-start.conf
[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-device-plugin --config-file=/etc/nvidia-k8s-device-plugin/settings.yaml
bash-5.1# cat /etc/nvidia-container-runtime/config.toml
### generated from the template file ###
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"
bash-5.1#
  1. Device Plugin was restarted after the values changed to allow containers access all GPUs
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugin": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "pass-device-specs": true
      }
    }
  }
}
bash-5.1# apiclient set kubelet-device-plugin.nvidia.device-list-strategy=envvar
bash-5.1# systemctl status nvidia-k8s-device-plugin.service
● nvidia-k8s-device-plugin.service - Start NVIDIA kubernetes device plugin
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/nvidia-k8s-device-plugin.service.d
             └─exec-start.conf
     Active: active (running) since Mon 2024-09-09 15:28:10 UTC; 9s ago
   Main PID: 11094 (nvidia-device-p)
      Tasks: 9 (limit: 38028)
     Memory: 17.9M
        CPU: 9.102s
     CGroup: /system.slice/nvidia-k8s-device-plugin.service
             └─11094 /usr/bin/nvidia-device-plugin --config-file=/etc/nvidia-k8s-device-plugin/settings.yaml

Sep 09 15:28:10 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]:     ]
Sep 09 15:28:10 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]:   },
Sep 09 15:28:10 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]:   "sharing": {
Sep 09 15:28:10 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]:     "timeSlicing": {}
Sep 09 15:28:10 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]:   }
Sep 09 15:28:10 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]: }
Sep 09 15:28:10 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]: I0909 15:28:10.178566   11094 main.go:317] Retrieving plugins.
Sep 09 15:28:17 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]: I0909 15:28:17.416865   11094 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Sep 09 15:28:17 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]: I0909 15:28:17.417380   11094 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Sep 09 15:28:17 ip-192-168-41-204.us-west-2.compute.internal nvidia-device-plugin[11094]: I0909 15:28:17.419653   11094 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
bash-5.1#
  1. Newly created container has access to all the GPUs without requesting any, after the configurations were updated:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: safe-defaults
spec:
  selector:
    matchLabels:
      name: safe-defaults
  template:
    metadata:
      labels:
        name: safe-defaults
    spec:
      # No GPUs requested
      containers:
        - name: safe-defaults
          image: nvidia/cuda:12.4.1-cudnn-devel-rockylinux8
          command: ['sh', '-c', 'sleep infinity']
└─> ❯ k exec safe-defaults-cnsqn -- nvidia-smi
Mon Sep  9 15:31:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   25C    P8               8W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
┌───────────────────> ~/Code/work/by-feature/nvidia-settings-api/testing on Fedora
└─> ❯
  1. In k8s 1.30, 1.29, 1.28, 1.27, I performed upgrade/downgrade testing. Confirmed that the new APIs were available in the upgrade, and the values were removed on a downgrade. I also performed the same testing described above, to make sure the defaults are set and that the containers with NVIDIA_VISIBLE_DEVICES=all can access all GPUs when the configurations to allow this are set.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@arnaldo2792 arnaldo2792 force-pushed the nvidia-api-kit branch 2 times, most recently from cc357d4 to d8b052f Compare September 10, 2024 23:12
@arnaldo2792
Copy link
Contributor Author

(forced push includes rebase)

@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Use the workspace migration helpers
  • Add bootstrap commands API to NVIDIA settings plugins
  • Minor stylistic edits to migrations

@arnaldo2792
Copy link
Contributor Author

(rebased develop and drop hack commit)

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Add settings defaults for nvidia-container-runtime and kubelet-device-plugin

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
@arnaldo2792
Copy link
Contributor Author

Forced push fixes the settings plugins definitions for aws-k8s-1.31

Copy link
Contributor

@ginglis13 ginglis13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some small nits

@@ -0,0 +1,19 @@
[package]
name = "settings-plugin-aws-k8s-nvidia"
version = "0.1.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add authors here?

[package]
name = "nvidia-container-runtime-settings"
version = "0.1.0"
authors = ["Monirul Islam <monirulu@amazon.com>"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight nit, should @arnaldo2792 be a listed author as well? (and for other crates added in this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the ownership of the work since the commits will have my name. My changes to that code were minimal 👍

Copy link
Contributor

@piyush-jena piyush-jena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@yeazelm yeazelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@arnaldo2792 arnaldo2792 merged commit 26ec2cc into bottlerocket-os:develop Sep 12, 2024
32 checks passed
@yeazelm yeazelm mentioned this pull request Sep 12, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants