Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-cli timeout error when running ECS tasks #3960

Closed
monirul opened this issue May 15, 2024 · 1 comment · Fixed by bottlerocket-os/bottlerocket-core-kit#122
Labels
area/accelerated-computing Issues related to GPUs/ASICs status/needs-proposal Needs a more detailed proposal for next steps type/bug Something isn't working

Comments

@monirul
Copy link
Contributor

monirul commented May 15, 2024

Image I'm using:
aws-ecs-2-nvidia (1.20.0)

What I expected to happen:
The container that requires NVIDIA GPU should run successfully in ECS variant of bottlerocket and the ECS task should complete successfully.

What actually happened:
When i tried to run a workload in the ecs cluster and the workload requires an NVIDIA GPU, the ECS task fails with an error

CannotStartContainerError: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: driver rpc error: timed out: unknown containerKnown 

How to reproduce the problem:

  1. Create a ECS cluster
  2. Provision an p5 instance for ECS-nvidia variant and configure it to join the ECS cluster you create at the first step
  3. Create a task that runs a workload that requires NVIDIA GPU( in my case I used Nvidia smoke test)
  4. Launch the task in the ECS cluster.
  5. Observe the error message indicating a failure to start the container.

Root Cause
The issue is caused due to timeout error while loading the driver right before running the container. Generally, the NVIDIA driver gets unloaded when there is no client connected to the driver. kernel mode driver is not already running or connected to a target GPU, the invocation of any program that attempts to interact with that GPU will transparently cause the driver to load and/or initialize the GPU.

Workaround:
To avoid the timeout error, we can enable the NVIDIA driver persistence mode by running the command nvidia-smi -pm 1. It allows to keep the GPUs initialized even when no clients are connected and prevents the kernel module from fully unloading software and hardware state when there are no connected clients. This way, we do not need to load the driver before running the containers and prevent timeout error.

Solution
According to NVIDIA documentation, to address this error and minimize the initial driver load time, NVIDIA offers a user-space daemon for Linux. This daemon ensures persistence of driver state across CUDA job runs, providing a better and reliable solution compared to the workaround involving persistence mode.

Proposal
I propose to include the nvidia-persistenced binary, provided by the nvidia driver, in the bottlerocket. And run it as a systemd unit to ensure the NVIDIA driver remains loaded and available, preventing the timeout error from occurring.

@monirul monirul added type/bug Something isn't working status/needs-triage Pending triage or re-evaluation labels May 15, 2024
@monirul monirul changed the title ECS task fails to run on bottlerocket with an error nvidia-container-cli: initialization error: driver rpc error: timed out ECS task fails with an error nvidia-container-cli: initialization error: driver rpc error: timed out May 15, 2024
@monirul monirul changed the title ECS task fails with an error nvidia-container-cli: initialization error: driver rpc error: timed out nvidia-container-cli timeout error when running ECS tasks May 15, 2024
@vigh-m vigh-m added area/accelerated-computing Issues related to GPUs/ASICs status/needs-proposal Needs a more detailed proposal for next steps and removed status/needs-triage Pending triage or re-evaluation labels May 16, 2024
@DamienMatias
Copy link

I have a similar error on EKS nodes, not sure if I should create a separate issue 🤔
To give you more details, I'm using Bottlerocket OS 1.20.3 (aws-k8s-1.28) and from what I tested, the issue appears with a g5.48xlarge (8GPUs) and not with a g5.12xlarge (4GPUs) or smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/accelerated-computing Issues related to GPUs/ASICs status/needs-proposal Needs a more detailed proposal for next steps type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants