-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update libnvidia-container and nvidia-container-toolkit #88
Update libnvidia-container and nvidia-container-toolkit #88
Conversation
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Last push includes updates for the k8s device plugin, so that we keep all the NVIDIA packages in sync. |
@@ -1,7 +1,7 @@ | |||
%global nvidia_modprobe_version 495.44 | |||
%global nvidia_modprobe_version 550.54.14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to align with the driver version? I see that we were on 495 which implies no, but the branches in the GitHub repo implies some level of alignment with driver branches. I can't find anything to indicate if there is versioning we need to be worried about here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a similar question because I noticed nvidia-modprobe 550.54.15
is not the latest 550
release (it's from February). But, this is the version defined for building libnvidia-container 1.16.1: https://github.com/NVIDIA/libnvidia-container/blob/v1.16.1/mk/nvidia-modprobe.mk#L9
and for the previous version 495.44. used, I see under libnvidia-container 1.13.3, this nvidia-modprobe version is defined:
https://github.com/NVIDIA/libnvidia-container/blob/v1.13.3/mk/nvidia-modprobe.mk#L9
I still am not sure about a definitive answer to "Is this supposed to align with the driver version?", but from what I can tell we're just finding this from where libnvidia-container
builds nvidia-modprobe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this modprobe
version doesn't have to align with any driver version. As you can see, we were in 495
and yet we shipped two different driver versions, way different than this version. They are not tightly coupled, this is only required to make libnvidia-container
happy. Nonetheless, I'll test aws-ecs-1-nvidia
which uses the older NVIDIA driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that an instance created with aws-ecs-1-nvidia
, which uses the older kernel module, does join a cluster and a task is running:
bash-5.1# docker exec -it 824b77b3d1c8 nvidia-smi
Thu Aug 15 20:23:01 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 32C P0 56W / 300W | 0MiB / 22731MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Description of changes:
This updates
libnvidia-container
andnvidia-container-toolkit
to their latest versions. With this update, the flags to compilenvidia-container-toolkit
were changed to allow lazy loading a library, since the upstream maintainers changed how the binaries load thenvml
. This is similar to what we already do in other places that depend onnvml
like the NVIDIA device plugin orecs-gpu-init
.Testing done:
Previously, the ECS aarch64 NVIDIA variant didn't work. With the latest version, I confirmed that the variant is functional and tasks run:
Tested that k8s 1.29 x86_64 node joined a cluster, and run a pod with 1 GPU:
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.