Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: nvidia-persistenced to Nvidia kmod packages #122

Merged
merged 7 commits into from
Oct 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions packages/ecs-gpu-init/ecs-gpu-init.service
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ Description=Initialize ECS GPU config
# otherwise the userspace component of the driver will fail to
# query the /dev devices
After=load-tesla-kernel-modules.service load-open-gpu-kernel-modules.service
# Running this unit after nvidia persistenced ensures that
# the /dev devices are created and the hardware set to
# persistence mode.
Requires=nvidia-persistenced.service
After=nvidia-persistenced.service
# Block manual interactions with this service. It doesn't
# make sense to regenerate the GPU config file if the ECS
# agent won't read it when it changes
Expand Down
21 changes: 21 additions & 0 deletions packages/kmod-5.10-nvidia/kmod-5.10-nvidia.spec
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ Source2: NVidiaEULAforAWS.pdf
# Common NVIDIA conf files from 200 to 299
Source200: nvidia-tmpfiles.conf.in
Source202: nvidia-dependencies-modules-load.conf
Source203: nvidia-sysusers.conf
Source204: nvidia-persistenced.service.in

# NVIDIA tesla conf files from 300 to 399
Source300: nvidia-tesla-tmpfiles.conf.in
Expand Down Expand Up @@ -86,6 +88,7 @@ install -d %{buildroot}%{_cross_libexecdir}
install -d %{buildroot}%{_cross_libdir}
install -d %{buildroot}%{_cross_tmpfilesdir}
install -d %{buildroot}%{_cross_unitdir}
install -d %{buildroot}%{_cross_bindir}
install -d %{buildroot}%{_cross_factorydir}%{_cross_sysconfdir}/{drivers,ld.so.conf.d}

KERNEL_VERSION=$(cat %{kernel_sources}/include/config/kernel.release)
Expand All @@ -105,6 +108,7 @@ install -d %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}
install -d %{buildroot}%{tesla_470_libdir}
install -d %{buildroot}%{_cross_datadir}/nvidia/tesla/%{tesla_470}/module-objects.d
install -d %{buildroot}%{_cross_factorydir}/nvidia/tesla/%{tesla_470}
install -d %{buildroot}%{_cross_sysusersdir}

sed -e 's|__NVIDIA_VERSION__|%{tesla_470}|' %{S:300} > nvidia-tesla-%{tesla_470}.conf
install -m 0644 nvidia-tesla-%{tesla_470}.conf %{buildroot}%{_cross_tmpfilesdir}/
Expand Down Expand Up @@ -158,10 +162,19 @@ install -m 755 nvidia-smi %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{te
install -m 755 nvidia-debugdump %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}
install -m 755 nvidia-cuda-mps-control %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}
install -m 755 nvidia-cuda-mps-server %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}
install -m 755 nvidia-persistenced %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about installing nvidia-persistenced in %{_cross_bindir}, since it is meant to be a system service. The problem with this approach is that if in the future NVIDIA ships other run archives different than the tesla archive and we have to include it, there might be two versions of nvidia-persistenced that would have to be shipped.

I think we can keep it as it is, and going forward, we could do some guessing at runtime based on the driver that was loaded to override the path to nvidia-persistenced using systemd drop-ins.

install -m 4755 nvidia-modprobe %{buildroot}%{_cross_bindir}
%if "%{_cross_arch}" == "x86_64"
install -m 755 nvidia-ngx-updater %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}
%endif

# Users
install -m 0644 %{S:203} %{buildroot}%{_cross_sysusersdir}/nvidia.conf

# Systemd units
sed -e 's|__NVIDIA_BINDIR__|%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}|' %{S:204} > nvidia-persistenced.service
install -m 0644 nvidia-persistenced.service %{buildroot}%{_cross_unitdir}

# We install all the libraries, and filter them out in the 'files' section, so we can catch
# when new libraries are added
install -m 755 *.so* %{buildroot}/%{tesla_470_libdir}/
Expand Down Expand Up @@ -206,6 +219,8 @@ popd
# Binaries
%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}/nvidia-debugdump
%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}/nvidia-smi
%{_cross_libexecdir}/nvidia/tesla/bin/%{tesla_470}/nvidia-persistenced
%{_cross_bindir}/nvidia-modprobe

# Configuration files
%{_cross_factorydir}%{_cross_sysconfdir}/drivers/nvidia-tesla-%{tesla_470}.toml
Expand All @@ -229,6 +244,12 @@ popd
# tmpfiles
%{_cross_tmpfilesdir}/nvidia-tesla-%{tesla_470}.conf

# sysuser files
%{_cross_sysusersdir}/nvidia.conf

# systemd units
%{_cross_unitdir}/nvidia-persistenced.service

# We only install the libraries required by all the DRIVER_CAPABILITIES, described here:
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities

Expand Down
10 changes: 10 additions & 0 deletions packages/kmod-5.10-nvidia/nvidia-persistenced.service.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[Unit]
Description=NVIDIA Persistence Daemon
After=load-tesla-kernel-modules.service load-open-gpu-kernel-modules.service

[Service]
Type=forking
arnaldo2792 marked this conversation as resolved.
Show resolved Hide resolved
ExecStart=__NVIDIA_BINDIR__/nvidia-persistenced --user nvidia --verbose

[Install]
RequiredBy=preconfigured.target
1 change: 1 addition & 0 deletions packages/kmod-5.10-nvidia/nvidia-sysusers.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
u nvidia - "nvidia-persistenced user"
1 change: 1 addition & 0 deletions packages/kmod-5.10-nvidia/nvidia-tmpfiles.conf.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
R __PREFIX__/lib/modules/__KERNEL_VERSION__/kernel/drivers/extra/video/nvidia/tesla - - - - -
d __PREFIX__/lib/modules/__KERNEL_VERSION__/kernel/drivers/extra/video/nvidia/tesla 0755 root root - -
D /var/run/nvidia-persistenced 0755 nvidia nvidia - -
20 changes: 20 additions & 0 deletions packages/kmod-5.15-nvidia/kmod-5.15-nvidia.spec
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ Source200: nvidia-tmpfiles.conf.in
Source202: nvidia-dependencies-modules-load.conf
Source203: nvidia-fabricmanager.service
Source204: nvidia-fabricmanager.cfg
Source205: nvidia-sysusers.conf
Source206: nvidia-persistenced.service

# NVIDIA tesla conf files from 300 to 399
Source300: nvidia-tesla-tmpfiles.conf
Expand Down Expand Up @@ -173,6 +175,8 @@ install -d %{buildroot}%{_cross_libdir}
install -d %{buildroot}%{_cross_tmpfilesdir}
install -d %{buildroot}%{_cross_unitdir}
install -d %{buildroot}%{_cross_factorydir}%{_cross_sysconfdir}/{drivers,ld.so.conf.d}
install -d %{buildroot}%{_cross_sysusersdir}
install -d %{buildroot}%{_cross_bindir}

KERNEL_VERSION=$(cat %{kernel_sources}/include/config/kernel.release)
sed \
Expand Down Expand Up @@ -279,10 +283,18 @@ install -m 755 nvidia-smi %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
install -m 755 nvidia-debugdump %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
install -m 755 nvidia-cuda-mps-control %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
install -m 755 nvidia-cuda-mps-server %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
install -m 755 nvidia-persistenced %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/
install -m 4755 nvidia-modprobe %{buildroot}%{_cross_bindir}
%if "%{_cross_arch}" == "x86_64"
install -m 755 nvidia-ngx-updater %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
%endif

# Users
install -m 0644 %{S:205} %{buildroot}%{_cross_sysusersdir}/nvidia.conf

# Systemd units
install -m 0644 %{S:206} %{buildroot}%{_cross_unitdir}

# We install all the libraries, and filter them out in the 'files' section, so we can catch
# when new libraries are added
install -m 755 *.so* %{buildroot}/%{_cross_libdir}/nvidia/tesla/
Expand Down Expand Up @@ -353,6 +365,8 @@ popd
%{_cross_libexecdir}/nvidia/tesla/bin/nvidia-smi
%{_cross_libexecdir}/nvidia/tesla/bin/nv-fabricmanager
%{_cross_libexecdir}/nvidia/tesla/bin/nvswitch-audit
%{_cross_libexecdir}/nvidia/tesla/bin/nvidia-persistenced
%{_cross_bindir}/nvidia-modprobe

# nvswitch topologies
%dir %{_cross_datadir}/nvidia/tesla/nvswitch
Expand Down Expand Up @@ -386,6 +400,12 @@ popd
# tmpfiles
%{_cross_tmpfilesdir}/nvidia-tesla.conf

# sysuser files
%{_cross_sysusersdir}/nvidia.conf

# systemd units
%{_cross_unitdir}/nvidia-persistenced.service

# We only install the libraries required by all the DRIVER_CAPABILITIES, described here:
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities

Expand Down
10 changes: 10 additions & 0 deletions packages/kmod-5.15-nvidia/nvidia-persistenced.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[Unit]
Description=NVIDIA Persistence Daemon
After=load-tesla-kernel-modules.service load-open-gpu-kernel-modules.service

[Service]
Type=forking
ExecStart=/usr/libexec/nvidia/tesla/bin/nvidia-persistenced --user nvidia --verbose

[Install]
RequiredBy=preconfigured.target
1 change: 1 addition & 0 deletions packages/kmod-5.15-nvidia/nvidia-sysusers.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
u nvidia - "nvidia-persistenced user"
1 change: 1 addition & 0 deletions packages/kmod-5.15-nvidia/nvidia-tmpfiles.conf.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ R __PREFIX__/lib/modules/__KERNEL_VERSION__/kernel/drivers/extra/video/nvidia/op
d __PREFIX__/lib/modules/__KERNEL_VERSION__/kernel/drivers/extra/video/nvidia/open-gpu 0755 root root - -
C /etc/nvidia/fabricmanager.cfg - - - -
d /run/nvidia 0700 root root -
D /var/run/nvidia-persistenced 0755 nvidia nvidia - -
20 changes: 20 additions & 0 deletions packages/kmod-6.1-nvidia/kmod-6.1-nvidia.spec
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ Source200: nvidia-tmpfiles.conf.in
Source202: nvidia-dependencies-modules-load.conf
Source203: nvidia-fabricmanager.service
Source204: nvidia-fabricmanager.cfg
Source205: nvidia-sysusers.conf
Source206: nvidia-persistenced.service

# NVIDIA tesla conf files from 300 to 399
Source300: nvidia-tesla-tmpfiles.conf
Expand Down Expand Up @@ -173,6 +175,8 @@ install -d %{buildroot}%{_cross_libdir}
install -d %{buildroot}%{_cross_tmpfilesdir}
install -d %{buildroot}%{_cross_unitdir}
install -d %{buildroot}%{_cross_factorydir}%{_cross_sysconfdir}/{drivers,ld.so.conf.d}
install -d %{buildroot}%{_cross_sysusersdir}
install -d %{buildroot}%{_cross_bindir}

KERNEL_VERSION=$(cat %{kernel_sources}/include/config/kernel.release)
sed \
Expand Down Expand Up @@ -279,10 +283,18 @@ install -m 755 nvidia-smi %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
install -m 755 nvidia-debugdump %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
install -m 755 nvidia-cuda-mps-control %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
install -m 755 nvidia-cuda-mps-server %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
install -m 755 nvidia-persistenced %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin/
install -m 4755 nvidia-modprobe %{buildroot}%{_cross_bindir}
%if "%{_cross_arch}" == "x86_64"
install -m 755 nvidia-ngx-updater %{buildroot}%{_cross_libexecdir}/nvidia/tesla/bin
%endif

# Users
install -m 0644 %{S:205} %{buildroot}%{_cross_sysusersdir}/nvidia.conf

# Systemd units
install -m 0644 %{S:206} %{buildroot}%{_cross_unitdir}

# We install all the libraries, and filter them out in the 'files' section, so we can catch
# when new libraries are added
install -m 755 *.so* %{buildroot}/%{_cross_libdir}/nvidia/tesla/
Expand Down Expand Up @@ -353,6 +365,8 @@ popd
%{_cross_libexecdir}/nvidia/tesla/bin/nvidia-smi
%{_cross_libexecdir}/nvidia/tesla/bin/nv-fabricmanager
%{_cross_libexecdir}/nvidia/tesla/bin/nvswitch-audit
%{_cross_libexecdir}/nvidia/tesla/bin/nvidia-persistenced
%{_cross_bindir}/nvidia-modprobe

# nvswitch topologies
%dir %{_cross_datadir}/nvidia/tesla/nvswitch
Expand Down Expand Up @@ -386,6 +400,12 @@ popd
# tmpfiles
%{_cross_tmpfilesdir}/nvidia-tesla.conf

# sysuser files
%{_cross_sysusersdir}/nvidia.conf

# systemd units
%{_cross_unitdir}/nvidia-persistenced.service

# We only install the libraries required by all the DRIVER_CAPABILITIES, described here:
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities

Expand Down
10 changes: 10 additions & 0 deletions packages/kmod-6.1-nvidia/nvidia-persistenced.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[Unit]
Description=NVIDIA Persistence Daemon
After=load-tesla-kernel-modules.service load-open-gpu-kernel-modules.service
bcressey marked this conversation as resolved.
Show resolved Hide resolved

[Service]
Type=forking
ExecStart=/usr/libexec/nvidia/tesla/bin/nvidia-persistenced --user nvidia --verbose

[Install]
RequiredBy=preconfigured.target
Comment on lines +9 to +10
Copy link
Contributor

@bcressey bcressey Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this doesn't specifically need to run in an early phase of boot, I'd just put it with the Fabric Manager in multi-user.target:

Suggested change
[Install]
RequiredBy=preconfigured.target
[Install]
WantedBy=multi-user.target

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can run at any time, but we should prefer that it runs earlier rather than later. Downstream customers may not always initialize the GPU device files themselves, so running this unit early ensures that those files are properly set by the time their units begin (as part of multi-user.target for example).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should run before services like ecs-gpu-init so that we can rely on this service to create the devices.

1 change: 1 addition & 0 deletions packages/kmod-6.1-nvidia/nvidia-sysusers.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
u nvidia - "nvidia-persistenced user"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we might use it for something else because the username isn't persistenced specific:

Suggested change
u nvidia - "nvidia-persistenced user"
u nvidia - "nvidia user"

1 change: 1 addition & 0 deletions packages/kmod-6.1-nvidia/nvidia-tmpfiles.conf.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ R __PREFIX__/lib/modules/__KERNEL_VERSION__/kernel/drivers/extra/video/nvidia/op
d __PREFIX__/lib/modules/__KERNEL_VERSION__/kernel/drivers/extra/video/nvidia/open-gpu 0755 root root - -
C /etc/nvidia/fabricmanager.cfg - - - -
d /run/nvidia 0700 root root -
D /var/run/nvidia-persistenced 0755 nvidia nvidia - -