Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update libnvidia-container and nvidia-container-toolkit #88

Merged
merged 3 commits into from
Aug 15, 2024

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented Aug 13, 2024

Description of changes:

This updates libnvidia-container and nvidia-container-toolkit to their latest versions. With this update, the flags to compile nvidia-container-toolkit were changed to allow lazy loading a library, since the upstream maintainers changed how the binaries load the nvml. This is similar to what we already do in other places that depend on nvml like the NVIDIA device plugin or ecs-gpu-init.

Testing done:

Previously, the ECS aarch64 NVIDIA variant didn't work. With the latest version, I confirmed that the variant is functional and tasks run:

bash-5.1# apiclient get os
{
  "os": {
    "arch": "aarch64",
    "build_id": "b12d2708-dirty",
    "pretty_name": "Bottlerocket OS 1.21.0 (aws-ecs-2-nvidia)",
    "variant_id": "aws-ecs-2-nvidia",
    "version_id": "1.21.0"
  }
}
bash-5.1# docker ps
CONTAINER ID   IMAGE     COMMAND            CREATED          STATUS          PORTS     NAMES
f5f232f87884   fedora    "sleep infinity"   57 minutes ago   Up 57 minutes             ecs-nvidia-5-nvidia-a2e5afed989db3fd1500
bash-5.1# docker exec -it f5f232f87884 nvidia-smi
Tue Aug 13 22:28:20 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA T4G                     Off | 00000000:00:1F.0 Off |                    0 |
| N/A   59C    P0              27W /  70W |      2MiB / 15360MiB |      8%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
bash-5.1#

Tested that k8s 1.29 x86_64 node joined a cluster, and run a pod with 1 GPU:

develop on  develop [$!] via 🦀 v1.79.0 on Fedora ❯ k exec gpu-tests-b9jfj -it -- nvidia-smi
Wed Aug 14 22:59:14 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P8              11W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
@arnaldo2792 arnaldo2792 changed the title Update nvidias Update libnvidia-container and nvidia-container-toolkit Aug 13, 2024
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
@arnaldo2792
Copy link
Contributor Author

Last push includes updates for the k8s device plugin, so that we keep all the NVIDIA packages in sync.

@@ -1,7 +1,7 @@
%global nvidia_modprobe_version 495.44
%global nvidia_modprobe_version 550.54.14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to align with the driver version? I see that we were on 495 which implies no, but the branches in the GitHub repo implies some level of alignment with driver branches. I can't find anything to indicate if there is versioning we need to be worried about here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a similar question because I noticed nvidia-modprobe 550.54.15 is not the latest 550 release (it's from February). But, this is the version defined for building libnvidia-container 1.16.1: https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/mk/nvidia-modprobe.mk#L9

and for the previous version 495.44. used, I see under libnvidia-container 1.13.3, this nvidia-modprobe version is defined:
https://github.com/NVIDIA/libnvidia-container/blob/v1.13.3/mk/nvidia-modprobe.mk#L9

I still am not sure about a definitive answer to "Is this supposed to align with the driver version?", but from what I can tell we're just finding this from where libnvidia-container builds nvidia-modprobe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this modprobe version doesn't have to align with any driver version. As you can see, we were in 495 and yet we shipped two different driver versions, way different than this version. They are not tightly coupled, this is only required to make libnvidia-container happy. Nonetheless, I'll test aws-ecs-1-nvidia which uses the older NVIDIA driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that an instance created with aws-ecs-1-nvidia, which uses the older kernel module, does join a cluster and a task is running:

bash-5.1# docker exec -it 824b77b3d1c8 nvidia-smi
Thu Aug 15 20:23:01 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   32C    P0    56W / 300W |      0MiB / 22731MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@arnaldo2792 arnaldo2792 merged commit 3852578 into bottlerocket-os:develop Aug 15, 2024
2 checks passed
@arnaldo2792 arnaldo2792 deleted the update-nvidias branch September 5, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants