Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA drivers not installing on Azure cloud runner #590

Closed
MaxHuerlimann opened this issue Jun 14, 2021 · 8 comments · Fixed by iterative/terraform-provider-iterative#144
Assignees
Labels
bug Something isn't working cml-runner Subcommand p0-critical Max priority (ASAP)

Comments

@MaxHuerlimann
Copy link

Hi everybody
I am trying to use cml-runner on GitLab to deploy a GPU machine on which to run training. The deployment works great but the docker container then running the training can't find any NVIDIA drivers it seems, as I can't run 'nvidia-smi'.

My .gitlab-ci.yml looks like this (simplified):

stages:
  - deploy
  - train

deploy:
  stage: deploy
  when: always
  image: dvcorg/cml:0-dvc2-base1-gpu
  script:
    - cml-runner
      --cloud azure
      --cloud-region eu-west
      --cloud-type Standard_NC4as_T4_v3
      --cloud-hdd-size 128
      --cloud-gpu v100
      --labels=cml-runner-gpu

train:
  stage: train
  when: on_success
  image: dvcorg/cml:0-dvc2-base1-gpu
  tags:
    - cml-runner-gpu

  script:
    - nvidia-smi

In the examples it's not mentioned that I need to install the drivers myself on the deployed machine, it looks like it should work out-of-the-box, or am I overlooking something? Is that only for AWS? Do I need to pass a script installing the drivers through --cloud-startup-script?

Cheers

@MaxHuerlimann
Copy link
Author

MaxHuerlimann commented Jun 14, 2021

Ah I saw on Discord that it was an issue with needing to reboot the machine. Apparently it should be resolved, so am I using a wrong docker image?

@0x2b3bfa0
Copy link
Member

👋🏼 Welcome, @MaxHuerlimann! Your images are the right ones, and nvidia-smi should work out of the box.

@0x2b3bfa0
Copy link
Member

Please try running your workflow again and, in case it doesn't work, feel free to ping us again on this issue.

@0x2b3bfa0 0x2b3bfa0 added the awaiting-response Waiting for user feedback label Jun 15, 2021
@MaxHuerlimann
Copy link
Author

MaxHuerlimann commented Jun 15, 2021

Hi, the issue is still persisting. If I pass a startup script that installs the drivers it works, though, so like this:

stages:
  - deploy
  - train

deploy:
  stage: deploy
  when: always
  image: dvcorg/cml:0-dvc2-base1-gpu
  script:
    - script=$(echo 'sudo apt-get update && sudo apt-get upgrade && sudo apt-get install -y nvidia-driver-460' | base64 --wrap 0)
    - cml-runner
      --cloud azure
      --cloud-region eu-west
      --cloud-type Standard_NC4as_T4_v3
      --cloud-hdd-size 128
      --cloud-startup-script $script
      --labels=cml-runner-gpu

train:
  stage: train
  when: on_success
  image: dvcorg/cml:0-dvc2-base1-gpu
  tags:
    - cml-runner-gpu

  script:
    - nvidia-smi

I even manually ssh'd to the created machine and could confirm that no NVIDIA drivers were installed if I didn't pass the above script.

@0x2b3bfa0 0x2b3bfa0 added bug Something isn't working p0-critical Max priority (ASAP) and removed awaiting-response Waiting for user feedback labels Jun 15, 2021
@0x2b3bfa0 0x2b3bfa0 changed the title NVIDIA driveres on Azure cloud runner NVIDIA drivers not installing on Azure cloud runner Jun 16, 2021
@0x2b3bfa0 0x2b3bfa0 added the cml-runner Subcommand label Jun 16, 2021
@0x2b3bfa0
Copy link
Member

From the /var/log/cloud-init-output.log file on a running instance:

$ sudo ubuntu-drivers autoinstall
Reading package lists...
Building dependency tree...
Reading state information...
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 linux-modules-nvidia-460-azure-edge : Depends: linux-modules-nvidia-460-5.4.0-1049-azure (= 5.4.0-1049.51~18.04.1) but it is not going to be installed
                                       Depends: nvidia-kernel-common-460 (<= 460.80-1) but 460.84-0ubuntu0~0.18.04.1 is to be installed
E: Unable to correct problems, you have held broken packages.

@0x2b3bfa0
Copy link
Member

Solved by removing the the graphics-drivers/ppa introduced with iterative/terraform-provider-iterative#135.

$ sudo add-apt-repository --remove ppa:graphics-drivers/ppa
$ sudo apt update
$ sudo ubuntu-drivers autoinstall

0x2b3bfa0 added a commit to iterative/terraform-provider-iterative that referenced this issue Jun 16, 2021
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jun 16, 2021

It should be fixed as soon as we release iterative/terraform-provider-iterative#144

@MaxHuerlimann
Copy link
Author

That's great! Thanks for the quick fix :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants