You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The driver version we are trying to run is 550-5.15.0-1078-azure.
However, the nvidia-driver init script step is failing for the daemonset nvidia-driver-daemonset
from the logs I see the following
Get:57 http://us.archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 nvidia-driver-550-server amd64 550.127.08-0ubuntu0.22.04.1 [489 kB]
Ign:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-minimal amd64 3.10.12-1~22.04.7
Ign:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10-minimal amd64 3.10.12-1~22.04.7
Ign:11 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-stdlib amd64 3.10.12-1~22.04.7
Ign:12 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10 amd64 3.10.12-1~22.04.7
Err:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-minimal amd64 3.10.12-1~22.04.7
404 Not Found [IP: 91.189.91.81 80]
Err:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10-minimal amd64 3.10.12-1~22.04.7
404 Not Found [IP: 91.189.91.81 80]
Err:11 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-stdlib amd64 3.10.12-1~22.04.7
404 Not Found [IP: 91.189.91.81 80]
Err:12 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10 amd64 3.10.12-1~22.04.7
404 Not Found [IP: 91.189.91.81 80]
Fetched 291 MB in 8s (35.3 MB/s)
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/libpython3.10-minimal_3.10.12-1%7e22.04.7_amd64.deb 404 Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/python3.10-minimal_3.10.12-1%7e22.04.7_amd64.deb 404 Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/libpython3.10-stdlib_3.10.12-1%7e22.04.7_amd64.deb 404 Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/python3.10_3.10.12-1%7e22.04.7_amd64.deb 404 Not Found [IP: 91.189.91.81 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
It seems that is cannot access the us.archive.ubuntu.com for whatever reason. I have pulled the image locally on my desktop and can replicate the same problem.
podman pull nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04
podman run -it --rm --entrypoint /bin/bash nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04
#now install a package or run nvidia-driver init
apt-get install vim
or
mkdir /run/nvidia
nvidia-driver init
and you will get the error about the NOT FOUND IP. I have seen the following IPs listed. [91.189.91.82, 91.189.91.83, 185.125.190.82, 185.125.190.81] for each run of apt-get install -y vim.
However, if I run an apt-get update then I don't have this problem and the install works. I don't know how to get my gpu-operator to run the daemonset and make sure that the apt-get update is run. Maybe this is just a problem with the image itself or maybe nvidia-driver should have apt-get udpate before it tries to install the packages
from the nvidia-driver shell script which is the entrypoint of the driver daemonset
# Link and install the kernel modules from a precompiled packages
_install_driver() {
# Install necessary userspace, fabric manager and libnvidia-nscq packages
apt-get install -y --no-install-recommends nvidia-driver-${DRIVER_BRANCH}-server
The text was updated successfully, but these errors were encountered:
I removed the args and just put everything in command. Now the driver starts properly however, this won't work long term as I am guessing the next time the operator deploys a new version we will have the same problem.
I was actually surprised to see the the operator didn't revert my manual change to the deployment.
Hello,
We are currently running operator version: 24.6.2
The driver version we are trying to run is 550-5.15.0-1078-azure.
However, the
nvidia-driver init
script step is failing for the daemonset nvidia-driver-daemonsetfrom the logs I see the following
It seems that is cannot access the us.archive.ubuntu.com for whatever reason. I have pulled the image locally on my desktop and can replicate the same problem.
and you will get the error about the NOT FOUND IP. I have seen the following IPs listed. [91.189.91.82, 91.189.91.83, 185.125.190.82, 185.125.190.81] for each run of apt-get install -y vim.
However, if I run an apt-get update then I don't have this problem and the install works. I don't know how to get my gpu-operator to run the daemonset and make sure that the apt-get update is run. Maybe this is just a problem with the image itself or maybe nvidia-driver should have apt-get udpate before it tries to install the packages
from the nvidia-driver shell script which is the entrypoint of the driver daemonset
The text was updated successfully, but these errors were encountered: