-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA drivers not installing on Azure cloud runner #590
NVIDIA drivers not installing on Azure cloud runner #590
Comments
Ah I saw on Discord that it was an issue with needing to reboot the machine. Apparently it should be resolved, so am I using a wrong docker image? |
👋🏼 Welcome, @MaxHuerlimann! Your images are the right ones, and |
Please try running your workflow again and, in case it doesn't work, feel free to ping us again on this issue. |
Hi, the issue is still persisting. If I pass a startup script that installs the drivers it works, though, so like this: stages:
- deploy
- train
deploy:
stage: deploy
when: always
image: dvcorg/cml:0-dvc2-base1-gpu
script:
- script=$(echo 'sudo apt-get update && sudo apt-get upgrade && sudo apt-get install -y nvidia-driver-460' | base64 --wrap 0)
- cml-runner
--cloud azure
--cloud-region eu-west
--cloud-type Standard_NC4as_T4_v3
--cloud-hdd-size 128
--cloud-startup-script $script
--labels=cml-runner-gpu
train:
stage: train
when: on_success
image: dvcorg/cml:0-dvc2-base1-gpu
tags:
- cml-runner-gpu
script:
- nvidia-smi I even manually ssh'd to the created machine and could confirm that no NVIDIA drivers were installed if I didn't pass the above script. |
From the $ sudo ubuntu-drivers autoinstall
Reading package lists...
Building dependency tree...
Reading state information...
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
linux-modules-nvidia-460-azure-edge : Depends: linux-modules-nvidia-460-5.4.0-1049-azure (= 5.4.0-1049.51~18.04.1) but it is not going to be installed
Depends: nvidia-kernel-common-460 (<= 460.80-1) but 460.84-0ubuntu0~0.18.04.1 is to be installed
E: Unable to correct problems, you have held broken packages. |
Solved by removing the the $ sudo add-apt-repository --remove ppa:graphics-drivers/ppa
$ sudo apt update
$ sudo ubuntu-drivers autoinstall |
It should be fixed as soon as we release iterative/terraform-provider-iterative#144 |
That's great! Thanks for the quick fix :) |
Hi everybody
I am trying to use
cml-runner
on GitLab to deploy a GPU machine on which to run training. The deployment works great but the docker container then running the training can't find any NVIDIA drivers it seems, as I can't run 'nvidia-smi'.My .gitlab-ci.yml looks like this (simplified):
In the examples it's not mentioned that I need to install the drivers myself on the deployed machine, it looks like it should work out-of-the-box, or am I overlooking something? Is that only for AWS? Do I need to pass a script installing the drivers through
--cloud-startup-script
?Cheers
The text was updated successfully, but these errors were encountered: