nvidia-container-cli timeout error when running ECS tasks #3960
Labels
area/accelerated-computing
Issues related to GPUs/ASICs
status/needs-proposal
Needs a more detailed proposal for next steps
type/bug
Something isn't working
Image I'm using:
aws-ecs-2-nvidia (1.20.0)
What I expected to happen:
The container that requires NVIDIA GPU should run successfully in ECS variant of bottlerocket and the ECS task should complete successfully.
What actually happened:
When i tried to run a workload in the ecs cluster and the workload requires an NVIDIA GPU, the ECS task fails with an error
How to reproduce the problem:
Root Cause
The issue is caused due to timeout error while loading the driver right before running the container. Generally, the NVIDIA driver gets unloaded when there is no client connected to the driver. kernel mode driver is not already running or connected to a target GPU, the invocation of any program that attempts to interact with that GPU will transparently cause the driver to load and/or initialize the GPU.
Workaround:
To avoid the timeout error, we can enable the NVIDIA driver persistence mode by running the command nvidia-smi -pm 1. It allows to keep the GPUs initialized even when no clients are connected and prevents the kernel module from fully unloading software and hardware state when there are no connected clients. This way, we do not need to load the driver before running the containers and prevent timeout error.
Solution
According to NVIDIA documentation, to address this error and minimize the initial driver load time, NVIDIA offers a user-space daemon for Linux. This daemon ensures persistence of driver state across CUDA job runs, providing a better and reliable solution compared to the workaround involving persistence mode.
Proposal
I propose to include the
nvidia-persistenced
binary, provided by the nvidia driver, in the bottlerocket. And run it as a systemd unit to ensure the NVIDIA driver remains loaded and available, preventing the timeout error from occurring.The text was updated successfully, but these errors were encountered: