-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: nvidia-persistenced to Nvidia kmod packages #122
feat: nvidia-persistenced to Nvidia kmod packages #122
Conversation
ec03060
to
a38557d
Compare
a38557d
to
5bb984f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good! Thanks for the contribution!
@@ -148,10 +149,15 @@ install -m 755 nvidia-smi %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{te | |||
install -m 755 nvidia-debugdump %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470} | |||
install -m 755 nvidia-cuda-mps-control %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470} | |||
install -m 755 nvidia-cuda-mps-server %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470} | |||
install -m 755 nvidia-persistenced %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about installing nvidia-persistenced
in %{_cross_bindir}
, since it is meant to be a system service. The problem with this approach is that if in the future NVIDIA ships other run
archives different than the tesla
archive and we have to include it, there might be two versions of nvidia-persistenced
that would have to be shipped.
I think we can keep it as it is, and going forward, we could do some guessing at runtime based on the driver that was loaded to override the path to nvidia-persistenced
using systemd drop-ins.
[Service] | ||
Type=forking | ||
# Run the NVIDIA System Management Interface to create the device nodes. | ||
ExecStart=__NVIDIA_BINDIR__/nvidia-persistenced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 5.15 and 6.1, the path to nvidia-persistenced
is more consistent and deterministic (it doesn't include the version). It should be fine to just use the full path here and you will save yourself the extra sed
.
5bb984f
to
49937d9
Compare
Tested the latest round of changes on
|
This older doc says:
So I really would like to bring back the |
I'd recommend including |
49937d9
to
ce6bdf4
Compare
Summary of changes
Testingaws-ecs-1-nvidia
aws-k8s-1.24-nvidia
aws-ecs-2-nvidia
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some feedback on the systemd unit, but no blockers.
[Install] | ||
RequiredBy=preconfigured.target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't specifically need to run in an early phase of boot, I'd just put it with the Fabric Manager in multi-user.target
:
[Install] | |
RequiredBy=preconfigured.target | |
[Install] | |
WantedBy=multi-user.target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can run at any time, but we should prefer that it runs earlier rather than later. Downstream customers may not always initialize the GPU device files themselves, so running this unit early ensures that those files are properly set by the time their units begin (as part of multi-user.target
for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should run before services like ecs-gpu-init
so that we can rely on this service to create the devices.
@@ -0,0 +1 @@ | |||
u nvidia - "nvidia-persistenced user" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since we might use it for something else because the username isn't persistenced
specific:
u nvidia - "nvidia-persistenced user" | |
u nvidia - "nvidia user" |
Whenever the NVIDIA device resources are no longer in use, the NVIDIA kernel driver will tear down the device state. `nvidia-persistenced` activates persistence mode, which keeps the device files open which prevents the kernel from removing the device state. This is desirable in applications that may suffer performance hits due to repeated device initialization. The NVIDIA device drivers ship with templates for running `nvidia-persistenced` as a systemd unit. This change uses that template. The `nvidia-persistenced` documentation advises that while the systemd unit can run as root, the unit should provide a non-root user for `nvidia-persistenced` to run under. See the documentation included with the NVIDIA driver for more information about `nvidia-persistenced`.
ce6bdf4
to
bd5e101
Compare
Re-built with the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Now
|
Closes bottlerocket-os/bottlerocket#3960
Description of changes:
Whenever the NVIDIA device resources are no longer in use, the NVIDIA
kernel driver will tear down the device state.
nvidia-persistenced
activates persistence mode, which keeps the device files open preventing
the kernel from removing the device state [1]. This is desirable in
applications that may suffer performance hits due to repeated device
initialization.
The NVIDIA device drivers ship with templates for running
nvidia-persistenced
as a systemd unit [2]. This change uses that template.The
nvidia-persistenced
documentation advises that while the systemdunit can run as root, the unit should provide a non-root user under which
nvidia-persistenced
will run. This change adds a non-root user fornvidia-persistenced
.See the documentation included with the NVIDIA driver for more
information about
nvidia-persistenced
[3].Testing Done:
ECS: The unit sets the devices to persistence mode, and we can see that the
device is set correctly inside a task container.
k8s: Node successfully joined cluster and persistence mode is enabled inside test pod.
aws-ecs-1-nvidia
aws-ecs-2-nvidia
aws-k8s-1.23-nvidia
aws-k8s-1.26-nvidia
aws-k8s-1.30-nvidia
Terms of contribution:
By submitting this pull request, I agree that this contribution is
dual-licensed under the terms of both the Apache License, version 2.0,
and the MIT license.