NVIDIA drivers or nvidia-docker issues #860

DavidGOrtega · 2022-01-06T10:35:29Z

Comming from discord

Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: 
process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , 
stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown

casperdcl · 2022-01-06T12:02:12Z

Looks like host drivers aren't installed correctly vis. NVIDIA/nvidia-docker#1393? What cloud provider is this on?

DavidGOrtega · 2022-01-13T10:41:43Z

Looks like host drivers aren't installed correctly vis. NVIDIA/nvidia-docker#1393? What cloud provider is this on?

Exactly! We had the same conclusion

It's also a funny riddle... its intermittent

0x2b3bfa0 · 2022-04-25T13:29:49Z

Hello from the future! Google Cloud provides an elegant solution on their official deeplearning-platform-release images.

`/etc/profile.d/install-driver-prompt.sh`

#!/bin/bash -eu
if ! nvidia-smi > /dev/null 2>&1; then
  if ! /usr/sbin/dkms status | grep nvidia > /dev/null 2>&1; then
    echo ""
    echo "This VM requires Nvidia drivers to function correctly. \
  Installation takes ~1 minute."
    read -p "Would you like to install the Nvidia driver? [y/n] " yn
    case $yn in
      [Yy]* )
        i=0
        # automatic updates will likely be running, wait for those to finish
        while sudo fuser /var/lib/dpkg/lock \
                         /var/lib/apt/lists/lock \
                         /var/cache/apt/archives/lock >/dev/null 2>&1 ; do
          case $i in
            0 ) j="-" ;;
            1 ) j="\\" ;;
            2 ) j="|" ;;
            3 ) j="/" ;;
          esac
          echo -en "\rWaiting for security updates to finish...$j"
          sleep 1
          i=$(((i+1) % 4))
        done
        echo "Installing Nvidia driver."
        sudo /opt/deeplearning/install-driver.sh
        echo "Nvidia driver installed."
      ;;
      * )
        echo "Nvidia drivers will not be installed. Run the command" \
             "'sudo /opt/deeplearning/install-driver.sh'" \
             "to install the driver packages."
        break
      ;;
    esac
  else
    # Security updates prior to the driver install would not be able to
    # automatically recompile the driver for the new kernel. In these cases,
    # manually recompile the driver.
    echo ""
    echo "Finalizing NVIDIA driver installation."
    DRIVER_DKMS="$(/usr/sbin/dkms status | grep nvidia | \
                   awk -F ", " '{print $1"/"$2}')"
    sudo /usr/sbin/dkms install ${DRIVER_DKMS}
    echo "Driver updated for latest kernel."
    # installation finished, remove prompt
    sudo rm -f /etc/profile.d/install-driver-prompt.sh
  fi
else
  # installation finished, remove prompt
  sudo rm -f /etc/profile.d/install-driver-prompt.sh
fi

0x2b3bfa0 · 2022-04-25T13:30:10Z

(Discovered on iterative/terraform-provider-iterative#533)

bmabir17-asj · 2022-05-11T17:09:27Z

@0x2b3bfa0 I am facing this error once in every week for a day and it gets fixed automatically. What do you suggest to solve this? I can't seems to run driver update before docker image is created.
sudo /opt/deeplearning/install-driver.sh
Can you suggest any possible way to run this in our workflow?

0x2b3bfa0 · 2022-05-11T17:32:14Z

Hello, @bmabir17-asj! Are you using cml runner on Google Cloud?

bmabir17-asj · 2022-05-11T18:14:25Z

@0x2b3bfa0 yes. Below is my workflow

jobs:  
  deploy-runner:    
    runs-on: [ubuntu-latest]    
    steps:      
      - uses: actions/checkout@v2      
      - uses: iterative/setup-cml@v1      
      - name: deploy        
        env:          
            REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
            GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_DATA }}         
        run: |          
          cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-t4*1 \
          --cloud-hdd-size=100 \
          --labels=cml-runner \
          --idle-timeout=3000 \
          --single \
          --cloud-permission-set=dvc-334@project_name.iam.gserviceaccount.com,scopes=storage-rw
  run:    
    needs: deploy-runner    
    runs-on: [self-hosted,cml-runner]    
    container:       
      image: docker://iterativeai/cml:0-dvc2-base1-gpu     
      options: --gpus all --shm-size=15gb     
    steps:    
    - uses: actions/checkout@v3     
    - uses: actions/setup-python@v2      
      with:        
        python-version: '3.8'
    - name: Get Branch
      id: branch
      shell: bash
      run: |
        echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
        echo "::set-output name=sha_short::$(git rev-parse --short HEAD)"
        echo "::set-output name=sha::$(git rev-parse HEAD)"


    - name: Check branch and Hash
      run: |
        echo "Branch: ${{ steps.branch.outputs.branch }}"
        echo "Sha: ${{ steps.branch.outputs.sha_short }}"
    - name: setup git config
      run: |
          git config user.name "$(git log --format='%ae' HEAD^!)"
          git config user.email "$(git log --format='%an' HEAD^!)"
    - name: cml      
      env:        
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        TB_CREDENTIALS: ${{ secrets.TB_CREDENTIALS }}
        BRANCH: ${{ steps.branch.outputs.branch }}
        SHA_SHORT: ${{ steps.branch.outputs.sha_short }}
        SHA: ${{ steps.branch.outputs.sha }}
        

      run: |        
        ### Setting variable
        echo "Branch: $BRANCH"
        echo "Sha: $SHA_SHORT"
        # python --version
        # pip list

        #updates gpg keys for packages
        wget -qO - https://dvc.org/deb/iterative.asc | apt-key add -
        apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
        # apt update 2>&1 1>/dev/null | sed -ne 's/.*NO_PUBKEY //p' | while read key; do if ! [[ ${keys[*]} =~ "$key" ]]; then apt-key adv --keyserver hkps://keyserver.ubuntu.com:443 --recv-keys "$key"; keys+=("$key"); fi; done
        
        apt-get update -y
        apt install imagemagick -y
        apt-get install ffmpeg libsm6 libxext6  -y
        git submodule update --init --recursive                 
        
        ### Install dependencies
        pip install -e ./anomalib/
        ls
        # export PYTHONPATH=.
        ### DVC stuff       
        git fetch --prune
        dvc pull 
        # dvc pull ./dataset/*
        
        ### Tensorboard Config
        cml-tensorboard-dev --logdir "./results/$BRANCH-$SHA_SHORT/patchcore/mvtec/bottle/logs" --md --name "Go to tensorboard" >> tb_report.md
        cml-send-comment tb_report.md

        ### Run the training
        python ./run_train.py --result "./results/$BRANCH-$SHA_SHORT" --commit_id "$SHA"
        # python ./anomalib/tools/train.py --model_config_path ./configs/patchcore/config.yaml
        # dvc repro

0x2b3bfa0 · 2022-05-12T08:32:08Z

@bmabir17-asj, can you please connect through SSH to the Google Cloud instance and paste the output of the following command for a failed run?

tail -n 10000 -f /var/log/syslog | awk 'match($0, /GCEMetadataScripts: startup-script:/){print $0}'

bmabir17-asj · 2022-05-12T21:55:27Z

@bmabir17-asj, can you please connect through SSH to the Google Cloud instance and paste the output of the following command for a failed run?
tail -n 10000 -f /var/log/syslog | awk 'match($0, /GCEMetadataScripts: startup-script:/){print $0}'

@0x2b3bfa0 here is the output of the above command
out.txt

bmabir17-asj · 2022-05-16T09:35:34Z

@0x2b3bfa0 Is there any update regarding this issue? my workflow is still failing for the last one week.

0x2b3bfa0 · 2022-05-16T09:51:24Z

Sorry for the late reply, @bmabir17-asj.

The last lines from the attached log file belong to the driver installation part of the script. 🤔 It looks like the log is not complete, or the script was terminated before finishing the install.

bmabir17-asj · 2022-05-16T10:03:18Z

@0x2b3bfa0 It was very difficult to capture the log. As the instance was shutting down as soon as the error occurs. I have captured another log. It has more line now. Please take e look
out_2.txt

bmabir17-asj · 2022-05-16T10:15:04Z

@0x2b3bfa0

@0x2b3bfa0 It was very difficult to capture the log. As the instance was shutting down as soon as the error occurs. I have captured another log. It has more line now. Please take e look out_2.txt

Line 3613 of the log shows this
Error: no integrated GPU detected.
That seems strange, because i have checked the instance in GCP console and it shows it has tesla-t4 attached.
Is this happening because GCP is unable to provision GPU in the given region? If so, then would it not just be able to create any instance at all?

dacbd · 2022-05-16T14:57:03Z

@bmabir17-asj when debugging the GCP instance I have found it helpful to:

as soon as the instance appears in the web console, edit it, and check the box for "Enable delete protection"
Note: you will need to edit it again and uncheck that box and then manually delete it.

bmabir17-asj · 2022-05-18T08:36:11Z

@dacbd @0x2b3bfa0 after trying a lot of things i can say that this problem is happening due driver not installed properly.
GPU is present and is attached.

$ lspci | grep -i nvidia
00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

The above commands proves that GPU is attached.

Is there any way i can choose which machine-image is being deployed with cml-runner?
Maybe like the following:

cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-k80*1 \
          --cloud-gpu=k80 \
          --cloud-machine-image="nvidia"

Or If that is not possible can i pass metadata like the following:

cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-k80*1 \
          --cloud-gpu=k80 \
          --cloud-metadata="install-nvidia-driver=true" \

UPDATE: I tried the second solution but it just creates a label for the GCP instance.

bmabir17-asj · 2022-05-18T09:40:57Z

@dacbd @0x2b3bfa0
I also tried the following from this issue but the error is the same

      script=$(echo 'sudo apt-get update && sudo apt-get upgrade && sudo apt-get install -y nvidia-driver-460' | base64 --wrap 0)                    
      cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-k80*1 \
          --cloud-gpu=k80 \
          --cloud-hdd-size=100 \
          --labels=cml-runner \
          --idle-timeout=3000 \
          --single \
          --cloud-metadata="install-nvidia-driver=true" \
          --cloud-startup-script $script

dacbd · 2022-05-18T13:37:51Z

I have been working with alternatives and should have something for you to try soon.

…

On Wed, May 18, 2022, 02:41 B M Abir ***@***.***> wrote: @dacbd <https://github.com/dacbd> @0x2b3bfa0 <https://github.com/0x2b3bfa0> I also tried the following from this issue <#590 (comment)> but the error is the same script=$(echo 'sudo apt-get update && sudo apt-get upgrade && sudo apt-get install -y nvidia-driver-460' | base64 --wrap 0) cml-runner \ --cloud=gcp \ --cloud-region=us-central1-a \ --cloud-type=n1-standard-4+nvidia-tesla-k80*1 \ --cloud-gpu=k80 \ --cloud-hdd-size=100 \ --labels=cml-runner \ --idle-timeout=3000 \ --single \ --cloud-metadata="install-nvidia-driver=true" \ --cloud-startup-script $script — Reply to this email directly, view it on GitHub <#860 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIN7M4G6FIF3HMAV2XB773VKS3LPANCNFSM5LMAUPTQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dacbd · 2022-05-21T17:10:58Z

@bmabir17-asj this has been working well for as part of the startup script: https://gist.github.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f

curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash

bmabir17-asj · 2022-05-22T10:44:41Z

@bmabir17-asj this has been working well for as part of the startup script: https://gist.github.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f
curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash

After running the script on startup-script the following error is thrown by cml runner. This happens sometimes, not always.
So workflow now runs sometimes but not always.

ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95%  159M 0s\n│ May 21 18:13:52 cml-ujjnwwvpsm c","stack":"Error: terraform -chdir='/home/runner/.cml/cml-ujjnwwvpsm' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n  + create\n\nTerraform will perform the following actions:\n\n  # iterative_cml_runner.runner will be created\n  + resource \"iterative_cml_runner\" \"runner\" ***\n      + cloud                   = \"gcp\"\n      + cml_version             = \"0.15.2\"\n      + docker_volumes          = []\n      + driver                  = \"github\"\n      + id                      = (known after apply)\n      + idle_timeout            = 3000\n      + instance_hdd_size       = 100\n      + instance_ip             = (known after apply)\n      + instance_launch_time    = (known after apply)\n      + instance_permission_set = \"dvc-334@anomaly-detection-engine.iam.gserviceaccount.com,scopes=storage-rw\"\n      + instance_type           = \"n1-standard-4+nvidia-tesla-t4*1\"\n      + labels                  = \"cml-runner\"\n      + metadata                = ***\n          + \"install-nvidia-driver\" = \"true\"\n        ***\n      + name                    = \"cml-ujjnwwvpsm\"\n      + region                  = \"us-central1-a\"\n      + repo                    = \"[https://github.com/chowagiken/anomaly_detection_engine\](https://github.com/chowagiken/anomaly_detection_engine/)"\n      + single                  = true\n      + spot                    = false\n      + spot_price              = -1\n      + ssh_public              = (known after apply)\n      + startup_script          = (sensitive value)\n      + token                   = (sensitive value)\n    ***\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s
.
.
.
.
ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95%  159M 0s\n│ May 21 18:13:52 cml-
ujjnwwvpsm c\n    at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n    at ChildProcess.exithandler (node:child_process:406:5)\n    at ChildProcess.emit (node:events:527:28)\n    at maybeClose (node:internal/child_process:1092:16)\n    at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"***

This is how i have run your startup script

steps:      
      - uses: actions/checkout@v2      
      - uses: iterative/setup-cml@v1      
      - name: deploy        
        env:          
            REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
            GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_DATA }}         
        run: |
          script=$(echo 'curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash' | base64 --wrap 0)
          cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-t4*1 \
          --cloud-hdd-size=100 \
          --labels=cml-runner \
          --idle-timeout=3000 \
          --single \
          --cloud-metadata="install-nvidia-driver=true" \
          --cloud-permission-set=dvc-334@anomaly-detection-engine.iam.gserviceaccount.com,scopes=storage-rw \
          --cloud-startup-script $script

dacbd · 2022-05-22T16:57:59Z

I'll take a another look at it but it looks like the startup script took too long so I would remove it from the startup script and run it as the first step on the instance, add a sudo so it runs as root.

…

On Sun, May 22, 2022, 03:44 B M Abir ***@***.***> wrote: @bmabir17-asj <https://github.com/bmabir17-asj> this has been working well for as part of the startup script: https://gist.github.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash After running the script on startup-script the following error is thrown by cml runner. This happens sometimes not always. ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95% 159M 0s\n│ May 21 18:13:52 cml-ujjnwwvpsm c","stack":"Error: terraform -chdir='/home/runner/.cml/cml-ujjnwwvpsm' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n + create\n\nTerraform will perform the following actions:\n\n # iterative_cml_runner.runner will be created\n + resource \"iterative_cml_runner\" \"runner\" ***\n + cloud = \"gcp\"\n + cml_version = \"0.15.2\"\n + docker_volumes = []\n + driver = \"github\"\n + id = (known after apply)\n + idle_timeout = 3000\n + instance_hdd_size = 100\n + instance_ip = (known after apply)\n + instance_launch_time = (known after apply)\n + instance_permission_set = ***@***.***,scopes=storage-rw\"\n + instance_type = \"n1-standard-4+nvidia-tesla-t4*1\"\n + labels = \"cml-runner\"\n + metadata = ***\n + \"install-nvidia-driver\" = \"true\"\n ***\n + name = \"cml-ujjnwwvpsm\"\n + region = \"us-central1-a\"\n + repo = \"[https://github.com/chowagiken/anomaly_detection_engine\](https://github.com/chowagiken/anomaly_detection_engine/)"\n + single = true\n + spot = false\n + spot_price = -1\n + ssh_public = (known after apply)\n + startup_script = (sensitive value)\n + token = (sensitive value)\n ***\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s . . . . ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95% 159M 0s\n│ May 21 18:13:52 cml- ujjnwwvpsm c\n at ***@***.***/cml/src/utils.js:20:27\n at ChildProcess.exithandler (node:child_process:406:5)\n at ChildProcess.emit (node:events:527:28)\n at maybeClose (node:internal/child_process:1092:16)\n at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"*** This is how i have run your startup script steps: - uses: ***@***.*** - uses: ***@***.*** - name: deploy env: REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }} GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_DATA }} run: | script=$(echo 'curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash' | base64 --wrap 0) cml-runner \ --cloud=gcp \ --cloud-region=us-central1-a \ --cloud-type=n1-standard-4+nvidia-tesla-t4*1 \ --cloud-hdd-size=100 \ --labels=cml-runner \ --idle-timeout=3000 \ --single \ --cloud-metadata="install-nvidia-driver=true" \ ***@***.***,scopes=storage-rw \ --cloud-startup-script $script — Reply to this email directly, view it on GitHub <#860 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIN7MY6LJMW5WV24SVKUW3VLIF2TANCNFSM5LMAUPTQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bmabir17-asj · 2022-05-22T17:18:18Z

so I would remove it from the startup script and run it as the
first step on the instance

@dacbd are you suggesting something like this?

run:    
    needs: deploy-runner    
    runs-on: [self-hosted,cml-runner]    
    container:       
      image: docker://iterativeai/cml:0-dvc2-base1-gpu     
      options: --gpus all --shm-size=15gb     
    steps:
    - name: cml      
      env:        
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        TB_CREDENTIALS: ${{ secrets.TB_CREDENTIALS }}
        BRANCH: ${{ steps.branch.outputs.branch }}
        SHA_SHORT: ${{ steps.branch.outputs.sha_short }}
        SHA: ${{ steps.branch.outputs.sha }}
        

      run: |        
        curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash
        dvc pull
        dvc repro

If so, won't it try to install the nvidia-driver from inside the docker container?

dacbd · 2022-05-22T17:38:12Z

I'll give a more complete example when I have sometime infront of a computer, as well as actually test it with GCP 😅

…

On Sun, May 22, 2022, 10:18 B M Abir ***@***.***> wrote: so I would remove it from the startup script and run it as the first step on the instance @dacbd <https://github.com/dacbd> are you suggesting something like this? run: needs: deploy-runner runs-on: [self-hosted,cml-runner] container: image: docker://iterativeai/cml:0-dvc2-base1-gpu options: --gpus all --shm-size=15gb steps: - name: cml env: REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }} TB_CREDENTIALS: ${{ secrets.TB_CREDENTIALS }} BRANCH: ${{ steps.branch.outputs.branch }} SHA_SHORT: ${{ steps.branch.outputs.sha_short }} SHA: ${{ steps.branch.outputs.sha }} run: | curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash dvc pull dvc repro If so, won't it try to install the nvidia-driver from inside the docker container? — Reply to this email directly, view it on GitHub <#860 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIN7MYWNQD5HMD3D63OA3TVLJT6PANCNFSM5LMAUPTQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dacbd · 2022-05-23T16:02:32Z

@bmabir17-asj correct, I wouldn't use the container in this case:

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v3
      - name: run cml
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.TEMP_KEY }}
          REPO_TOKEN: ${{ secrets.DACBD_PAT }}
        run: |
          cml runner \
            --cloud=gcp \
            --cloud-region=us-central1-a \
            --cloud-type=n1-standard-4+nvidia-tesla-t4*1 \
            --cloud-hdd-size=100 \
            --labels=cml-runner \
            --idle-timeout=3000 \
            --single

  test:
    needs: [deploy]
    runs-on: [cml-runner]
    steps:
      - uses: actions/checkout@v3
      - uses: iterative/setup-dvc@v1
      - run: sudo systemd-run --pipe --service-type=exec bash -c 'curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/db3cba14dcc4a23fb1b7c7a115563942d4164aaf/nvidia-src-setup.sh | bash'
      - run: |
          dvc doctor
          nvidia-smi

Worked for me, you can use something like actions/setup-python to get a specific python version on the instance.

bmabir17-asj · 2022-06-13T06:12:54Z

@dacbd
Thank you for the example.
But now i am getting a different error
Failed to initialize NVML: Driver/library version mismatch
This shows while executing nvidia-smi. without using containers.

And also having similar error using with containers

  Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown
  Error: failed to start containers: c90ce2b71b7226be3f7ef8b1c5e0371c052a4518c46b2617e01d6f0ad51e2d69
  Error: Docker start fail with exit code 1

FIY, your driver installation script was working for few days.

dacbd · 2022-06-13T16:53:26Z

@bmabir17-asj they released a new version 17 days ago, I updated the script to try that one. https://github.com/NVIDIA/open-gpu-kernel-modules/tags

bmabir17-asj · 2022-06-14T09:17:18Z

@dacbd I am still getting the same error with the updated install script.
The following is the last few lines from install script

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 515.48.07
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.
/
Failed to initialize NVML: Driver/library version mismatch
Module                  Size  Used by
nvidia              35340288  0
btrfs                1388544  0
blake2b_generic        20480  0
xor                    24576  1 btrfs
zstd_compress         176128  1 btrfs
raid6_pq              114688  1 btrfs
ufs                    81920  0
msdos                  20480  0
xfs                  1515520  0
xt_conntrack           16384  1
xt_MASQUERADE          20480  1
xfrm_user              36864  1
xfrm_algo              16384  1 xfrm_user
xt_addrtype            16384  2
iptable_filter         16384  1
iptable_nat            16384  1
nf_nat                 49152  2 iptable_nat,xt_MASQUERADE
bpfilter               16384  0
br_netfilter           28672  0
bridge                266240  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 bridge,stp
aufs                  258048  0
overlay               126976  0
nls_iso8859_1          16384  1
dm_multipath           40960  0
scsi_dh_rdac           16384  0
scsi_dh_emc            16384  0
scsi_dh_alua           20480  0
crct10dif_pclmul       16384  1
crc32_pclmul           16384  0
ghash_clmulni_intel    16384  0
aesni_intel           376832  0
psmouse               155648  0
crypto_simd            16384  1 aesni_intel
cryptd                 24576  2 crypto_simd,ghash_clmulni_intel
virtio_net             57344  0
input_leds             16384  0
serio_raw              20480  0
net_failover           20480  1 virtio_net
failover               16384  1 net_failover
efi_pstore             16384  0
sch_fq_codel           20480  5
drm                   557056  1 nvidia
virtio_rng             16384  0
ip_tables              32768  2 iptable_filter,iptable_nat
x_tables               49152  6 xt_conntrack,iptable_filter,xt_addrtype,ip_tables,iptable_nat,xt_MASQUERADE
autofs4                45056  2

install_script_log.txt

dacbd · 2022-06-14T15:26:02Z

@bmabir17-asj without access to your workflow/gcp to try a couple things out I think your best bet would be to try and create your own custom vm image with where the drivers you need are correctly setup. If you get that I can show you how to tweak cml to use that image.

bmabir17-asj · 2022-06-14T17:12:34Z

@dacbd I can make my custom vm image, if you can show me how i can use that cusom vm image with cml runner.

dacbd · 2022-06-14T17:38:03Z

@bmabir17-asj can do, I'll plan to build an example for you (my tomorrow morning).

bmabir17-asj · 2022-06-16T16:26:37Z

@dacbd thanks for the effort 😄

casperdcl · 2022-06-17T09:07:21Z

Seems like discussion in NVIDIA/nvidia-container-toolkit#257 is also still active 😞

0x2b3bfa0 · 2022-06-17T10:05:30Z

@casperdcl, don't we track the issue you mention through NVIDIA/nvidia-docker#1001? 🤔

bmabir17-asj · 2022-06-19T20:33:04Z

@DavidGOrtega @dacbd , should i use terraform iterative provider instead of CML to provision GCP instance? As this error is still causing my workflow to break.

danieljimeneznz · 2022-06-19T21:43:21Z

@bmabir17-asj - might be potentially fixed in iterative/terraform-provider-iterative#607 - @DavidGOrtega should be able to confirm if this is the case.

dacbd · 2022-06-22T15:08:50Z

@bmabir17-asj sorry for the delay, this should be fixed for you and the driver setup is more stable without any workarounds, if you run into anything else let us know.

casperdcl · 2022-07-18T14:47:47Z

Is this resolved?

dacbd · 2022-07-18T16:04:20Z

I'm not up to speed on the original docker/Nvidia issue. but the subsequently discuss GCP/Nvidia issue has been resolved.

casperdcl · 2022-07-18T16:29:33Z

Closing for now then :)

bmabir17-asj · 2022-10-11T08:18:47Z

@dacbd thanks for the help Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Daniel Barnes ***@***.***> Sent: Tuesday, June 14, 2022 11:38:13 PM To: iterative/cml ***@***.***> Cc: Abir BM ***@***.***>; Mention ***@***.***> Subject: Re: [iterative/cml] NVIDIA drivers or nvidia-docker issues (Issue NVIDIA/nvidia-docker#860) @bmabir17-asj<https://github.com/bmabir17-asj> can do, I'll plan to build an example for you (my tomorrow morning). — Reply to this email directly, view it on GitHub<#860 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATOW4Z6T7CYXMQJVOQ7L32DVPC7QLANCNFSM5LMAUPTQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

0x2b3bfa0 · 2023-08-15T17:49:09Z

Valuable information:

NVIDIA drivers or nvidia-docker issues #860 (comment)

NVIDIA drivers or nvidia-docker issues #860 (comment)

#!/bin/bash
driver_version="515.48.07"
nvidia_dl_path="https://us.download.nvidia.com/XFree86/Linux-x86_64/"

__temp=$(mktemp -d)
pushd "$__temp" || exit
apt-get update -y
apt-get install -y git build-essential
wget "$nvidia_dl_path$driver_version/NVIDIA-Linux-x86_64-$driver_version.run" -O web-installer
git clone --depth 1 --branch "$driver_version" https://github.com/NVIDIA/open-gpu-kernel-modules.git
pushd open-gpu-kernel-modules || exit
make modules -j"$(nproc)"
make modules_install -j"$(nproc)"
popd || exit
# [--silent] is [--ui=none --no-questions]
sh ./web-installer --silent --no-kernel-modules
# uninstall and reinstall because reasons?
sh ./web-installer --silent --uninstall
sh ./web-installer --silent --no-kernel-modules
popd || exit
sleep 5
nvidia-smi
lsmod

if lsmod | grep -q nvidia ; then
  echo "Install Complete"
else
  echo "Install Failed"
fi
# No reboot required

DavidGOrtega added cml-image Subcommand cml-runner Subcommand p0-critical Max priority (ASAP) labels Jan 6, 2022

casperdcl added the cloud-gcp Google Cloud label Jan 11, 2022

casperdcl assigned DavidGOrtega Jan 11, 2022

DavidGOrtega mentioned this issue Feb 7, 2022

Reboot instance if gpu is not accesible iterative/terraform-provider-iterative#383

Merged

DavidGOrtega closed this as completed in iterative/terraform-provider-iterative#383 Feb 7, 2022

bmabir17-asj mentioned this issue May 21, 2022

Revert NVIDIA fix #1001

Open

casperdcl assigned dacbd and unassigned DavidGOrtega Jun 7, 2022

0x2b3bfa0 mentioned this issue Jun 16, 2022

Google Compute Engine VM with V100 doesn't use the GPU (NVIDIA driver issue?) #1065

Closed

casperdcl reopened this Jun 17, 2022

casperdcl assigned DavidGOrtega Jun 17, 2022

casperdcl closed this as completed Jul 18, 2022

evamaxfield mentioned this issue Dec 20, 2022

Minimal GCP GPU example #1291

Closed

NVIDIA drivers or nvidia-docker issues #860

NVIDIA drivers or nvidia-docker issues #860

Comments

DavidGOrtega commented Jan 6, 2022

casperdcl commented Jan 6, 2022 • edited Loading

DavidGOrtega commented Jan 13, 2022

0x2b3bfa0 commented Apr 25, 2022

/etc/profile.d/install-driver-prompt.sh

0x2b3bfa0 commented Apr 25, 2022 • edited Loading

bmabir17-asj commented May 11, 2022 • edited Loading

0x2b3bfa0 commented May 11, 2022

bmabir17-asj commented May 11, 2022 • edited Loading

0x2b3bfa0 commented May 12, 2022

bmabir17-asj commented May 12, 2022

bmabir17-asj commented May 16, 2022

0x2b3bfa0 commented May 16, 2022

bmabir17-asj commented May 16, 2022

bmabir17-asj commented May 16, 2022 • edited Loading

dacbd commented May 16, 2022

bmabir17-asj commented May 18, 2022 • edited Loading

bmabir17-asj commented May 18, 2022

dacbd commented May 18, 2022 via email

dacbd commented May 21, 2022

bmabir17-asj commented May 22, 2022 • edited Loading

dacbd commented May 22, 2022 via email

bmabir17-asj commented May 22, 2022 • edited Loading

dacbd commented May 22, 2022 via email

dacbd commented May 23, 2022

bmabir17-asj commented Jun 13, 2022

dacbd commented Jun 13, 2022

bmabir17-asj commented Jun 14, 2022 • edited Loading

dacbd commented Jun 14, 2022

bmabir17-asj commented Jun 14, 2022

dacbd commented Jun 14, 2022

bmabir17-asj commented Jun 16, 2022

casperdcl commented Jun 17, 2022

0x2b3bfa0 commented Jun 17, 2022

bmabir17-asj commented Jun 19, 2022

danieljimeneznz commented Jun 19, 2022 • edited Loading

dacbd commented Jun 22, 2022

casperdcl commented Jul 18, 2022

dacbd commented Jul 18, 2022

casperdcl commented Jul 18, 2022

bmabir17-asj commented Oct 11, 2022 via email

0x2b3bfa0 commented Aug 15, 2023 • edited Loading

casperdcl commented Jan 6, 2022 •

edited

Loading

`/etc/profile.d/install-driver-prompt.sh`

0x2b3bfa0 commented Apr 25, 2022 •

edited

Loading

bmabir17-asj commented May 11, 2022 •

edited

Loading

bmabir17-asj commented May 11, 2022 •

edited

Loading

bmabir17-asj commented May 16, 2022 •

edited

Loading

bmabir17-asj commented May 18, 2022 •

edited

Loading

bmabir17-asj commented May 22, 2022 •

edited

Loading

bmabir17-asj commented May 22, 2022 •

edited

Loading

bmabir17-asj commented Jun 14, 2022 •

edited

Loading

danieljimeneznz commented Jun 19, 2022 •

edited

Loading

0x2b3bfa0 commented Aug 15, 2023 •

edited

Loading