Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA drivers or nvidia-docker issues #860

Closed
DavidGOrtega opened this issue Jan 6, 2022 · 40 comments · Fixed by iterative/terraform-provider-iterative#383
Closed

NVIDIA drivers or nvidia-docker issues #860

DavidGOrtega opened this issue Jan 6, 2022 · 40 comments · Fixed by iterative/terraform-provider-iterative#383
Assignees
Labels
cloud-gcp Google Cloud cml-image Subcommand cml-runner Subcommand p0-critical Max priority (ASAP)

Comments

@DavidGOrtega
Copy link
Contributor

Comming from discord

Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: 
process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , 
stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
@DavidGOrtega DavidGOrtega added cml-image Subcommand cml-runner Subcommand p0-critical Max priority (ASAP) labels Jan 6, 2022
@casperdcl
Copy link
Contributor

casperdcl commented Jan 6, 2022

Looks like host drivers aren't installed correctly vis. NVIDIA/nvidia-docker#1393? What cloud provider is this on?

@DavidGOrtega
Copy link
Contributor Author

Looks like host drivers aren't installed correctly vis. NVIDIA/nvidia-docker#1393? What cloud provider is this on?

Exactly! We had the same conclusion

It's also a funny riddle... its intermittent

@0x2b3bfa0
Copy link
Member

Hello from the future! Google Cloud provides an elegant solution on their official deeplearning-platform-release images.

/etc/profile.d/install-driver-prompt.sh

#!/bin/bash -eu
if ! nvidia-smi > /dev/null 2>&1; then
  if ! /usr/sbin/dkms status | grep nvidia > /dev/null 2>&1; then
    echo ""
    echo "This VM requires Nvidia drivers to function correctly. \
  Installation takes ~1 minute."
    read -p "Would you like to install the Nvidia driver? [y/n] " yn
    case $yn in
      [Yy]* )
        i=0
        # automatic updates will likely be running, wait for those to finish
        while sudo fuser /var/lib/dpkg/lock \
                         /var/lib/apt/lists/lock \
                         /var/cache/apt/archives/lock >/dev/null 2>&1 ; do
          case $i in
            0 ) j="-" ;;
            1 ) j="\\" ;;
            2 ) j="|" ;;
            3 ) j="/" ;;
          esac
          echo -en "\rWaiting for security updates to finish...$j"
          sleep 1
          i=$(((i+1) % 4))
        done
        echo "Installing Nvidia driver."
        sudo /opt/deeplearning/install-driver.sh
        echo "Nvidia driver installed."
      ;;
      * )
        echo "Nvidia drivers will not be installed. Run the command" \
             "'sudo /opt/deeplearning/install-driver.sh'" \
             "to install the driver packages."
        break
      ;;
    esac
  else
    # Security updates prior to the driver install would not be able to
    # automatically recompile the driver for the new kernel. In these cases,
    # manually recompile the driver.
    echo ""
    echo "Finalizing NVIDIA driver installation."
    DRIVER_DKMS="$(/usr/sbin/dkms status | grep nvidia | \
                   awk -F ", " '{print $1"/"$2}')"
    sudo /usr/sbin/dkms install ${DRIVER_DKMS}
    echo "Driver updated for latest kernel."
    # installation finished, remove prompt
    sudo rm -f /etc/profile.d/install-driver-prompt.sh
  fi
else
  # installation finished, remove prompt
  sudo rm -f /etc/profile.d/install-driver-prompt.sh
fi

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Apr 25, 2022

(Discovered on iterative/terraform-provider-iterative#533)

@bmabir17-asj
Copy link

bmabir17-asj commented May 11, 2022

@0x2b3bfa0 I am facing this error once in every week for a day and it gets fixed automatically. What do you suggest to solve this? I can't seems to run driver update before docker image is created.
sudo /opt/deeplearning/install-driver.sh
Can you suggest any possible way to run this in our workflow?

@0x2b3bfa0
Copy link
Member

Hello, @bmabir17-asj! Are you using cml runner on Google Cloud?

@bmabir17-asj
Copy link

bmabir17-asj commented May 11, 2022

@0x2b3bfa0 yes. Below is my workflow

jobs:  
  deploy-runner:    
    runs-on: [ubuntu-latest]    
    steps:      
      - uses: actions/checkout@v2      
      - uses: iterative/setup-cml@v1      
      - name: deploy        
        env:          
            REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
            GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_DATA }}         
        run: |          
          cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-t4*1 \
          --cloud-hdd-size=100 \
          --labels=cml-runner \
          --idle-timeout=3000 \
          --single \
          --cloud-permission-set=dvc-334@project_name.iam.gserviceaccount.com,scopes=storage-rw
  run:    
    needs: deploy-runner    
    runs-on: [self-hosted,cml-runner]    
    container:       
      image: docker://iterativeai/cml:0-dvc2-base1-gpu     
      options: --gpus all --shm-size=15gb     
    steps:    
    - uses: actions/checkout@v3     
    - uses: actions/setup-python@v2      
      with:        
        python-version: '3.8'
    - name: Get Branch
      id: branch
      shell: bash
      run: |
        echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
        echo "::set-output name=sha_short::$(git rev-parse --short HEAD)"
        echo "::set-output name=sha::$(git rev-parse HEAD)"


    - name: Check branch and Hash
      run: |
        echo "Branch: ${{ steps.branch.outputs.branch }}"
        echo "Sha: ${{ steps.branch.outputs.sha_short }}"
    - name: setup git config
      run: |
          git config user.name "$(git log --format='%ae' HEAD^!)"
          git config user.email "$(git log --format='%an' HEAD^!)"
    - name: cml      
      env:        
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        TB_CREDENTIALS: ${{ secrets.TB_CREDENTIALS }}
        BRANCH: ${{ steps.branch.outputs.branch }}
        SHA_SHORT: ${{ steps.branch.outputs.sha_short }}
        SHA: ${{ steps.branch.outputs.sha }}
        

      run: |        
        ### Setting variable
        echo "Branch: $BRANCH"
        echo "Sha: $SHA_SHORT"
        # python --version
        # pip list

        #updates gpg keys for packages
        wget -qO - https://dvc.org/deb/iterative.asc | apt-key add -
        apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
        # apt update 2>&1 1>/dev/null | sed -ne 's/.*NO_PUBKEY //p' | while read key; do if ! [[ ${keys[*]} =~ "$key" ]]; then apt-key adv --keyserver hkps://keyserver.ubuntu.com:443 --recv-keys "$key"; keys+=("$key"); fi; done
        
        apt-get update -y
        apt install imagemagick -y
        apt-get install ffmpeg libsm6 libxext6  -y
        git submodule update --init --recursive                 
        
        ### Install dependencies
        pip install -e ./anomalib/
        ls
        # export PYTHONPATH=.
        ### DVC stuff       
        git fetch --prune
        dvc pull 
        # dvc pull ./dataset/*
        
        ### Tensorboard Config
        cml-tensorboard-dev --logdir "./results/$BRANCH-$SHA_SHORT/patchcore/mvtec/bottle/logs" --md --name "Go to tensorboard" >> tb_report.md
        cml-send-comment tb_report.md

        ### Run the training
        python ./run_train.py --result "./results/$BRANCH-$SHA_SHORT" --commit_id "$SHA"
        # python ./anomalib/tools/train.py --model_config_path ./configs/patchcore/config.yaml
        # dvc repro

@0x2b3bfa0
Copy link
Member

@bmabir17-asj, can you please connect through SSH to the Google Cloud instance and paste the output of the following command for a failed run?

tail -n 10000 -f /var/log/syslog | awk 'match($0, /GCEMetadataScripts: startup-script:/){print $0}'

@bmabir17-asj
Copy link

@bmabir17-asj, can you please connect through SSH to the Google Cloud instance and paste the output of the following command for a failed run?

tail -n 10000 -f /var/log/syslog | awk 'match($0, /GCEMetadataScripts: startup-script:/){print $0}'

@0x2b3bfa0 here is the output of the above command
out.txt

@bmabir17-asj
Copy link

@0x2b3bfa0 Is there any update regarding this issue? my workflow is still failing for the last one week.

@0x2b3bfa0
Copy link
Member

Sorry for the late reply, @bmabir17-asj.

The last lines from the attached log file belong to the driver installation part of the script. 🤔 It looks like the log is not complete, or the script was terminated before finishing the install.

@bmabir17-asj
Copy link

@0x2b3bfa0 It was very difficult to capture the log. As the instance was shutting down as soon as the error occurs. I have captured another log. It has more line now. Please take e look
out_2.txt

@bmabir17-asj
Copy link

bmabir17-asj commented May 16, 2022

@0x2b3bfa0

@0x2b3bfa0 It was very difficult to capture the log. As the instance was shutting down as soon as the error occurs. I have captured another log. It has more line now. Please take e look out_2.txt

Line 3613 of the log shows this
Error: no integrated GPU detected.
That seems strange, because i have checked the instance in GCP console and it shows it has tesla-t4 attached.
Is this happening because GCP is unable to provision GPU in the given region? If so, then would it not just be able to create any instance at all?

@dacbd
Copy link
Contributor

dacbd commented May 16, 2022

@bmabir17-asj when debugging the GCP instance I have found it helpful to:

  • as soon as the instance appears in the web console, edit it, and check the box for "Enable delete protection"
  • Note: you will need to edit it again and uncheck that box and then manually delete it.

@bmabir17-asj
Copy link

bmabir17-asj commented May 18, 2022

@dacbd @0x2b3bfa0 after trying a lot of things i can say that this problem is happening due driver not installed properly.
GPU is present and is attached.

$ lspci | grep -i nvidia
00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

The above commands proves that GPU is attached.

Is there any way i can choose which machine-image is being deployed with cml-runner?
Maybe like the following:

cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-k80*1 \
          --cloud-gpu=k80 \
          --cloud-machine-image="nvidia"

Or If that is not possible can i pass metadata like the following:

cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-k80*1 \
          --cloud-gpu=k80 \
          --cloud-metadata="install-nvidia-driver=true" \

UPDATE: I tried the second solution but it just creates a label for the GCP instance.

@bmabir17-asj
Copy link

@dacbd @0x2b3bfa0
I also tried the following from this issue but the error is the same

      script=$(echo 'sudo apt-get update && sudo apt-get upgrade && sudo apt-get install -y nvidia-driver-460' | base64 --wrap 0)                    
      cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-k80*1 \
          --cloud-gpu=k80 \
          --cloud-hdd-size=100 \
          --labels=cml-runner \
          --idle-timeout=3000 \
          --single \
          --cloud-metadata="install-nvidia-driver=true" \
          --cloud-startup-script $script

@dacbd
Copy link
Contributor

dacbd commented May 18, 2022 via email

@dacbd
Copy link
Contributor

dacbd commented May 21, 2022

@bmabir17-asj this has been working well for as part of the startup script: https://gist.github.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f

curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash

@bmabir17-asj
Copy link

bmabir17-asj commented May 22, 2022

@bmabir17-asj this has been working well for as part of the startup script: https://gist.github.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f

curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash

After running the script on startup-script the following error is thrown by cml runner. This happens sometimes, not always.
So workflow now runs sometimes but not always.

ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95%  159M 0s\n│ May 21 18:13:52 cml-ujjnwwvpsm c","stack":"Error: terraform -chdir='/home/runner/.cml/cml-ujjnwwvpsm' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n  + create\n\nTerraform will perform the following actions:\n\n  # iterative_cml_runner.runner will be created\n  + resource \"iterative_cml_runner\" \"runner\" ***\n      + cloud                   = \"gcp\"\n      + cml_version             = \"0.15.2\"\n      + docker_volumes          = []\n      + driver                  = \"github\"\n      + id                      = (known after apply)\n      + idle_timeout            = 3000\n      + instance_hdd_size       = 100\n      + instance_ip             = (known after apply)\n      + instance_launch_time    = (known after apply)\n      + instance_permission_set = \"dvc-334@anomaly-detection-engine.iam.gserviceaccount.com,scopes=storage-rw\"\n      + instance_type           = \"n1-standard-4+nvidia-tesla-t4*1\"\n      + labels                  = \"cml-runner\"\n      + metadata                = ***\n          + \"install-nvidia-driver\" = \"true\"\n        ***\n      + name                    = \"cml-ujjnwwvpsm\"\n      + region                  = \"us-central1-a\"\n      + repo                    = \"[https://github.com/chowagiken/anomaly_detection_engine\](https://github.com/chowagiken/anomaly_detection_engine/)"\n      + single                  = true\n      + spot                    = false\n      + spot_price              = -1\n      + ssh_public              = (known after apply)\n      + startup_script          = (sensitive value)\n      + token                   = (sensitive value)\n    ***\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s
.
.
.
.
ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95%  159M 0s\n│ May 21 18:13:52 cml-
ujjnwwvpsm c\n    at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n    at ChildProcess.exithandler (node:child_process:406:5)\n    at ChildProcess.emit (node:events:527:28)\n    at maybeClose (node:internal/child_process:1092:16)\n    at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"***

This is how i have run your startup script

steps:      
      - uses: actions/checkout@v2      
      - uses: iterative/setup-cml@v1      
      - name: deploy        
        env:          
            REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
            GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_DATA }}         
        run: |
          script=$(echo 'curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash' | base64 --wrap 0)
          cml-runner \
          --cloud=gcp \
          --cloud-region=us-central1-a \
          --cloud-type=n1-standard-4+nvidia-tesla-t4*1 \
          --cloud-hdd-size=100 \
          --labels=cml-runner \
          --idle-timeout=3000 \
          --single \
          --cloud-metadata="install-nvidia-driver=true" \
          --cloud-permission-set=dvc-334@anomaly-detection-engine.iam.gserviceaccount.com,scopes=storage-rw \
          --cloud-startup-script $script

@dacbd
Copy link
Contributor

dacbd commented May 22, 2022 via email

@bmabir17-asj
Copy link

bmabir17-asj commented May 22, 2022

so I would remove it from the startup script and run it as the
first step on the instance

@dacbd are you suggesting something like this?

run:    
    needs: deploy-runner    
    runs-on: [self-hosted,cml-runner]    
    container:       
      image: docker://iterativeai/cml:0-dvc2-base1-gpu     
      options: --gpus all --shm-size=15gb     
    steps:
    - name: cml      
      env:        
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        TB_CREDENTIALS: ${{ secrets.TB_CREDENTIALS }}
        BRANCH: ${{ steps.branch.outputs.branch }}
        SHA_SHORT: ${{ steps.branch.outputs.sha_short }}
        SHA: ${{ steps.branch.outputs.sha }}
        

      run: |        
        curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash
        dvc pull
        dvc repro

If so, won't it try to install the nvidia-driver from inside the docker container?

@dacbd
Copy link
Contributor

dacbd commented May 22, 2022 via email

@dacbd
Copy link
Contributor

dacbd commented May 23, 2022

@bmabir17-asj correct, I wouldn't use the container in this case:

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v3
      - name: run cml
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.TEMP_KEY }}
          REPO_TOKEN: ${{ secrets.DACBD_PAT }}
        run: |
          cml runner \
            --cloud=gcp \
            --cloud-region=us-central1-a \
            --cloud-type=n1-standard-4+nvidia-tesla-t4*1 \
            --cloud-hdd-size=100 \
            --labels=cml-runner \
            --idle-timeout=3000 \
            --single

  test:
    needs: [deploy]
    runs-on: [cml-runner]
    steps:
      - uses: actions/checkout@v3
      - uses: iterative/setup-dvc@v1
      - run: sudo systemd-run --pipe --service-type=exec bash -c 'curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/db3cba14dcc4a23fb1b7c7a115563942d4164aaf/nvidia-src-setup.sh | bash'
      - run: |
          dvc doctor
          nvidia-smi

Worked for me, you can use something like actions/setup-python to get a specific python version on the instance.

@casperdcl casperdcl assigned dacbd and unassigned DavidGOrtega Jun 7, 2022
@bmabir17-asj
Copy link

@dacbd
Thank you for the example.
But now i am getting a different error
Failed to initialize NVML: Driver/library version mismatch
This shows while executing nvidia-smi. without using containers.

And also having similar error using with containers

  Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown
  Error: failed to start containers: c90ce2b71b7226be3f7ef8b1c5e0371c052a4518c46b2617e01d6f0ad51e2d69
  Error: Docker start fail with exit code 1

FIY, your driver installation script was working for few days.

@dacbd
Copy link
Contributor

dacbd commented Jun 13, 2022

@bmabir17-asj they released a new version 17 days ago, I updated the script to try that one. https://github.com/NVIDIA/open-gpu-kernel-modules/tags

@bmabir17-asj
Copy link

bmabir17-asj commented Jun 14, 2022

@dacbd I am still getting the same error with the updated install script.
The following is the last few lines from install script

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 515.48.07
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.
/
Failed to initialize NVML: Driver/library version mismatch
Module                  Size  Used by
nvidia              35340288  0
btrfs                1388544  0
blake2b_generic        20480  0
xor                    24576  1 btrfs
zstd_compress         176128  1 btrfs
raid6_pq              114688  1 btrfs
ufs                    81920  0
msdos                  20480  0
xfs                  1515520  0
xt_conntrack           16384  1
xt_MASQUERADE          20480  1
xfrm_user              36864  1
xfrm_algo              16384  1 xfrm_user
xt_addrtype            16384  2
iptable_filter         16384  1
iptable_nat            16384  1
nf_nat                 49152  2 iptable_nat,xt_MASQUERADE
bpfilter               16384  0
br_netfilter           28672  0
bridge                266240  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 bridge,stp
aufs                  258048  0
overlay               126976  0
nls_iso8859_1          16384  1
dm_multipath           40960  0
scsi_dh_rdac           16384  0
scsi_dh_emc            16384  0
scsi_dh_alua           20480  0
crct10dif_pclmul       16384  1
crc32_pclmul           16384  0
ghash_clmulni_intel    16384  0
aesni_intel           376832  0
psmouse               155648  0
crypto_simd            16384  1 aesni_intel
cryptd                 24576  2 crypto_simd,ghash_clmulni_intel
virtio_net             57344  0
input_leds             16384  0
serio_raw              20480  0
net_failover           20480  1 virtio_net
failover               16384  1 net_failover
efi_pstore             16384  0
sch_fq_codel           20480  5
drm                   557056  1 nvidia
virtio_rng             16384  0
ip_tables              32768  2 iptable_filter,iptable_nat
x_tables               49152  6 xt_conntrack,iptable_filter,xt_addrtype,ip_tables,iptable_nat,xt_MASQUERADE
autofs4                45056  2

install_script_log.txt

@dacbd
Copy link
Contributor

dacbd commented Jun 14, 2022

@bmabir17-asj without access to your workflow/gcp to try a couple things out I think your best bet would be to try and create your own custom vm image with where the drivers you need are correctly setup. If you get that I can show you how to tweak cml to use that image.

@bmabir17-asj
Copy link

@dacbd I can make my custom vm image, if you can show me how i can use that cusom vm image with cml runner.

@dacbd
Copy link
Contributor

dacbd commented Jun 14, 2022

@bmabir17-asj can do, I'll plan to build an example for you (my tomorrow morning).

@bmabir17-asj
Copy link

@dacbd thanks for the effort 😄

@casperdcl
Copy link
Contributor

Seems like discussion in NVIDIA/nvidia-container-toolkit#257 is also still active 😞

@0x2b3bfa0
Copy link
Member

@casperdcl, don't we track the issue you mention through NVIDIA/nvidia-docker#1001? 🤔

@bmabir17-asj
Copy link

@DavidGOrtega @dacbd , should i use terraform iterative provider instead of CML to provision GCP instance? As this error is still causing my workflow to break.

@danieljimeneznz
Copy link
Contributor

danieljimeneznz commented Jun 19, 2022

@bmabir17-asj - might be potentially fixed in iterative/terraform-provider-iterative#607 - @DavidGOrtega should be able to confirm if this is the case.

@dacbd
Copy link
Contributor

dacbd commented Jun 22, 2022

@bmabir17-asj sorry for the delay, this should be fixed for you and the driver setup is more stable without any workarounds, if you run into anything else let us know.

@casperdcl
Copy link
Contributor

Is this resolved?

@dacbd
Copy link
Contributor

dacbd commented Jul 18, 2022

I'm not up to speed on the original docker/Nvidia issue. but the subsequently discuss GCP/Nvidia issue has been resolved.

@casperdcl
Copy link
Contributor

Closing for now then :)

@bmabir17-asj
Copy link

bmabir17-asj commented Oct 11, 2022 via email

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Aug 15, 2023

Valuable information:

  • NVIDIA drivers or nvidia-docker issues #860 (comment)
  • NVIDIA drivers or nvidia-docker issues #860 (comment)
    #!/bin/bash
    driver_version="515.48.07"
    nvidia_dl_path="https://us.download.nvidia.com/XFree86/Linux-x86_64/"
    
    __temp=$(mktemp -d)
    pushd "$__temp" || exit
    apt-get update -y
    apt-get install -y git build-essential
    wget "$nvidia_dl_path$driver_version/NVIDIA-Linux-x86_64-$driver_version.run" -O web-installer
    git clone --depth 1 --branch "$driver_version" https://github.com/NVIDIA/open-gpu-kernel-modules.git
    pushd open-gpu-kernel-modules || exit
    make modules -j"$(nproc)"
    make modules_install -j"$(nproc)"
    popd || exit
    # [--silent] is [--ui=none --no-questions]
    sh ./web-installer --silent --no-kernel-modules
    # uninstall and reinstall because reasons?
    sh ./web-installer --silent --uninstall
    sh ./web-installer --silent --no-kernel-modules
    popd || exit
    sleep 5
    nvidia-smi
    lsmod
    
    if lsmod | grep -q nvidia ; then
      echo "Install Complete"
    else
      echo "Install Failed"
    fi
    # No reboot required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-gcp Google Cloud cml-image Subcommand cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants