Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia driver daemonset does not run due to apt-cache issue. #1244

Open
ScottWatsonWork opened this issue Jan 30, 2025 · 1 comment
Open

Comments

@ScottWatsonWork
Copy link

Hello,

We are currently running operator version: 24.6.2

The driver version we are trying to run is 550-5.15.0-1078-azure.

However, the nvidia-driver init script step is failing for the daemonset nvidia-driver-daemonset

from the logs I see the following

Get:57 http://us.archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 nvidia-driver-550-server amd64 550.127.08-0ubuntu0.22.04.1 [489 kB]
Ign:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-minimal amd64 3.10.12-1~22.04.7
Ign:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10-minimal amd64 3.10.12-1~22.04.7
Ign:11 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-stdlib amd64 3.10.12-1~22.04.7
Ign:12 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10 amd64 3.10.12-1~22.04.7
Err:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-minimal amd64 3.10.12-1~22.04.7
  404  Not Found [IP: 91.189.91.81 80]
Err:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10-minimal amd64 3.10.12-1~22.04.7
  404  Not Found [IP: 91.189.91.81 80]
Err:11 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-stdlib amd64 3.10.12-1~22.04.7
  404  Not Found [IP: 91.189.91.81 80]
Err:12 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10 amd64 3.10.12-1~22.04.7
  404  Not Found [IP: 91.189.91.81 80]
Fetched 291 MB in 8s (35.3 MB/s)
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/libpython3.10-minimal_3.10.12-1%7e22.04.7_amd64.deb  404  Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/python3.10-minimal_3.10.12-1%7e22.04.7_amd64.deb  404  Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/libpython3.10-stdlib_3.10.12-1%7e22.04.7_amd64.deb  404  Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/python3.10_3.10.12-1%7e22.04.7_amd64.deb  404  Not Found [IP: 91.189.91.81 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

It seems that is cannot access the us.archive.ubuntu.com for whatever reason. I have pulled the image locally on my desktop and can replicate the same problem.

podman pull nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04

podman run -it --rm --entrypoint /bin/bash nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04

#now install a package or run nvidia-driver init 
apt-get install vim 

or 
mkdir /run/nvidia 
nvidia-driver init

and you will get the error about the NOT FOUND IP. I have seen the following IPs listed. [91.189.91.82, 91.189.91.83, 185.125.190.82, 185.125.190.81] for each run of apt-get install -y vim.

However, if I run an apt-get update then I don't have this problem and the install works. I don't know how to get my gpu-operator to run the daemonset and make sure that the apt-get update is run. Maybe this is just a problem with the image itself or maybe nvidia-driver should have apt-get udpate before it tries to install the packages

from the nvidia-driver shell script which is the entrypoint of the driver daemonset

# Link and install the kernel modules from a precompiled packages
_install_driver() {
    # Install necessary userspace, fabric manager and libnvidia-nscq packages
    apt-get install -y --no-install-recommends nvidia-driver-${DRIVER_BRANCH}-server
@ScottWatsonWork
Copy link
Author

So as a workaround I did the following as it has already been 24 hours that we had been looking into this issue.

I changed the yaml for the daemonset to be as follows.

      containers:
      - command:
        - /bin/sh
        - -c
        - apt-get update && exec nvidia-driver init
        image: nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04
        imagePullPolicy: IfNotPresent

I removed the args and just put everything in command. Now the driver starts properly however, this won't work long term as I am guessing the next time the operator deploys a new version we will have the same problem.

I was actually surprised to see the the operator didn't revert my manual change to the deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant