-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA drivers or nvidia-docker issues #860
NVIDIA drivers or nvidia-docker issues #860
Comments
Looks like host drivers aren't installed correctly vis. NVIDIA/nvidia-docker#1393? What cloud provider is this on? |
Exactly! We had the same conclusion It's also a funny riddle... its intermittent |
Hello from the future! Google Cloud provides an elegant solution on their official
|
(Discovered on iterative/terraform-provider-iterative#533) |
@0x2b3bfa0 I am facing this error once in every week for a day and it gets fixed automatically. What do you suggest to solve this? I can't seems to run driver update before docker image is created. |
Hello, @bmabir17-asj! Are you using |
@0x2b3bfa0 yes. Below is my workflow
|
@bmabir17-asj, can you please connect through SSH to the Google Cloud instance and paste the output of the following command for a failed run?
|
@0x2b3bfa0 here is the output of the above command |
@0x2b3bfa0 Is there any update regarding this issue? my workflow is still failing for the last one week. |
Sorry for the late reply, @bmabir17-asj. The last lines from the attached log file belong to the driver installation part of the script. 🤔 It looks like the log is not complete, or the script was terminated before finishing the install. |
@0x2b3bfa0 It was very difficult to capture the log. As the instance was shutting down as soon as the error occurs. I have captured another log. It has more line now. Please take e look |
Line 3613 of the log shows this |
@bmabir17-asj when debugging the GCP instance I have found it helpful to:
|
@dacbd @0x2b3bfa0 after trying a lot of things i can say that this problem is happening due driver not installed properly. $ lspci | grep -i nvidia
00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) The above commands proves that GPU is attached. Is there any way i can choose which machine-image is being deployed with cml-runner?
Or If that is not possible can i pass metadata like the following:
UPDATE: I tried the second solution but it just creates a label for the GCP instance. |
@dacbd @0x2b3bfa0
|
I have been working with alternatives and should have something for you to
try soon.
…On Wed, May 18, 2022, 02:41 B M Abir ***@***.***> wrote:
@dacbd <https://github.com/dacbd> @0x2b3bfa0
<https://github.com/0x2b3bfa0>
I also tried the following from this issue
<#590 (comment)> but
the error is the same
script=$(echo 'sudo apt-get update && sudo apt-get upgrade && sudo apt-get install -y nvidia-driver-460' | base64 --wrap 0)
cml-runner \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=n1-standard-4+nvidia-tesla-k80*1 \
--cloud-gpu=k80 \
--cloud-hdd-size=100 \
--labels=cml-runner \
--idle-timeout=3000 \
--single \
--cloud-metadata="install-nvidia-driver=true" \
--cloud-startup-script $script
—
Reply to this email directly, view it on GitHub
<#860 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIN7M4G6FIF3HMAV2XB773VKS3LPANCNFSM5LMAUPTQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@bmabir17-asj this has been working well for as part of the startup script: https://gist.github.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f
|
After running the script on startup-script the following error is thrown by cml runner. This happens sometimes, not always. ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95% 159M 0s\n│ May 21 18:13:52 cml-ujjnwwvpsm c","stack":"Error: terraform -chdir='/home/runner/.cml/cml-ujjnwwvpsm' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n + create\n\nTerraform will perform the following actions:\n\n # iterative_cml_runner.runner will be created\n + resource \"iterative_cml_runner\" \"runner\" ***\n + cloud = \"gcp\"\n + cml_version = \"0.15.2\"\n + docker_volumes = []\n + driver = \"github\"\n + id = (known after apply)\n + idle_timeout = 3000\n + instance_hdd_size = 100\n + instance_ip = (known after apply)\n + instance_launch_time = (known after apply)\n + instance_permission_set = \"dvc-334@anomaly-detection-engine.iam.gserviceaccount.com,scopes=storage-rw\"\n + instance_type = \"n1-standard-4+nvidia-tesla-t4*1\"\n + labels = \"cml-runner\"\n + metadata = ***\n + \"install-nvidia-driver\" = \"true\"\n ***\n + name = \"cml-ujjnwwvpsm\"\n + region = \"us-central1-a\"\n + repo = \"[https://github.com/chowagiken/anomaly_detection_engine\](https://github.com/chowagiken/anomaly_detection_engine/)"\n + single = true\n + spot = false\n + spot_price = -1\n + ssh_public = (known after apply)\n + startup_script = (sensitive value)\n + token = (sensitive value)\n ***\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s
.
.
.
.
ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95% 159M 0s\n│ May 21 18:13:52 cml-
ujjnwwvpsm c\n at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n at ChildProcess.exithandler (node:child_process:406:5)\n at ChildProcess.emit (node:events:527:28)\n at maybeClose (node:internal/child_process:1092:16)\n at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"***
This is how i have run your startup script
|
I'll take a another look at it but it looks like the startup script took
too long so I would remove it from the startup script and run it as the
first step on the instance, add a sudo so it runs as root.
…On Sun, May 22, 2022, 03:44 B M Abir ***@***.***> wrote:
@bmabir17-asj <https://github.com/bmabir17-asj> this has been working
well for as part of the startup script:
https://gist.github.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f
curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash
After running the script on startup-script the following error is thrown
by cml runner. This happens sometimes not always.
ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95% 159M 0s\n│ May 21 18:13:52 cml-ujjnwwvpsm c","stack":"Error: terraform -chdir='/home/runner/.cml/cml-ujjnwwvpsm' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n + create\n\nTerraform will perform the following actions:\n\n # iterative_cml_runner.runner will be created\n + resource \"iterative_cml_runner\" \"runner\" ***\n + cloud = \"gcp\"\n + cml_version = \"0.15.2\"\n + docker_volumes = []\n + driver = \"github\"\n + id = (known after apply)\n + idle_timeout = 3000\n + instance_hdd_size = 100\n + instance_ip = (known after apply)\n + instance_launch_time = (known after apply)\n + instance_permission_set = ***@***.***,scopes=storage-rw\"\n + instance_type = \"n1-standard-4+nvidia-tesla-t4*1\"\n + labels = \"cml-runner\"\n + metadata = ***\n + \"install-nvidia-driver\" = \"true\"\n ***\n + name = \"cml-ujjnwwvpsm\"\n + region = \"us-central1-a\"\n + repo = \"[https://github.com/chowagiken/anomaly_detection_engine\](https://github.com/chowagiken/anomaly_detection_engine/)"\n + single = true\n + spot = false\n + spot_price = -1\n + ssh_public = (known after apply)\n + startup_script = (sensitive value)\n + token = (sensitive value)\n ***\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s
.
.
.
.
ujjnwwvpsm cml.sh[44041]: 334400K .......... ..........\n│ .......... .......... .......... 95% 159M 0s\n│ May 21 18:13:52 cml-
ujjnwwvpsm c\n at ***@***.***/cml/src/utils.js:20:27\n at ChildProcess.exithandler (node:child_process:406:5)\n at ChildProcess.emit (node:events:527:28)\n at maybeClose (node:internal/child_process:1092:16)\n at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"***
This is how i have run your startup script
steps:
- uses: ***@***.***
- uses: ***@***.***
- name: deploy
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_DATA }}
run: |
script=$(echo 'curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash' | base64 --wrap 0)
cml-runner \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=n1-standard-4+nvidia-tesla-t4*1 \
--cloud-hdd-size=100 \
--labels=cml-runner \
--idle-timeout=3000 \
--single \
--cloud-metadata="install-nvidia-driver=true" \
***@***.***,scopes=storage-rw \
--cloud-startup-script $script
—
Reply to this email directly, view it on GitHub
<#860 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIN7MY6LJMW5WV24SVKUW3VLIF2TANCNFSM5LMAUPTQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dacbd are you suggesting something like this?
If so, won't it try to install the nvidia-driver from inside the docker container? |
I'll give a more complete example when I have sometime infront of a
computer, as well as actually test it with GCP 😅
…On Sun, May 22, 2022, 10:18 B M Abir ***@***.***> wrote:
so I would remove it from the startup script and run it as the
first step on the instance
@dacbd <https://github.com/dacbd> are you suggesting something like this?
run:
needs: deploy-runner
runs-on: [self-hosted,cml-runner]
container:
image: docker://iterativeai/cml:0-dvc2-base1-gpu
options: --gpus all --shm-size=15gb
steps:
- name: cml
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
TB_CREDENTIALS: ${{ secrets.TB_CREDENTIALS }}
BRANCH: ${{ steps.branch.outputs.branch }}
SHA_SHORT: ${{ steps.branch.outputs.sha_short }}
SHA: ${{ steps.branch.outputs.sha }}
run: |
curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/7130adc8e44501534d9aa2c25aca6a61896a8d64/nvidia-src-setup-apt.sh | bash
dvc pull
dvc repro
If so, won't it try to install the nvidia-driver from inside the docker
container?
—
Reply to this email directly, view it on GitHub
<#860 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIN7MYWNQD5HMD3D63OA3TVLJT6PANCNFSM5LMAUPTQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@bmabir17-asj correct, I wouldn't use the container in this case: jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v3
- name: run cml
env:
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.TEMP_KEY }}
REPO_TOKEN: ${{ secrets.DACBD_PAT }}
run: |
cml runner \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=n1-standard-4+nvidia-tesla-t4*1 \
--cloud-hdd-size=100 \
--labels=cml-runner \
--idle-timeout=3000 \
--single
test:
needs: [deploy]
runs-on: [cml-runner]
steps:
- uses: actions/checkout@v3
- uses: iterative/setup-dvc@v1
- run: sudo systemd-run --pipe --service-type=exec bash -c 'curl https://gist.githubusercontent.com/dacbd/c527d1a214f7118e6d66e52a6abb4c4f/raw/db3cba14dcc4a23fb1b7c7a115563942d4164aaf/nvidia-src-setup.sh | bash'
- run: |
dvc doctor
nvidia-smi Worked for me, you can use something like |
@dacbd And also having similar error using with containers
FIY, your driver installation script was working for few days. |
@bmabir17-asj they released a new version 17 days ago, I updated the script to try that one. https://github.com/NVIDIA/open-gpu-kernel-modules/tags |
@dacbd I am still getting the same error with the updated install script.
|
@bmabir17-asj without access to your workflow/gcp to try a couple things out I think your best bet would be to try and create your own custom vm image with where the drivers you need are correctly setup. If you get that I can show you how to tweak cml to use that image. |
@dacbd I can make my custom vm image, if you can show me how i can use that cusom vm image with cml runner. |
@bmabir17-asj can do, I'll plan to build an example for you (my tomorrow morning). |
@dacbd thanks for the effort 😄 |
Seems like discussion in NVIDIA/nvidia-container-toolkit#257 is also still active 😞 |
@casperdcl, don't we track the issue you mention through NVIDIA/nvidia-docker#1001? 🤔 |
@DavidGOrtega @dacbd , should i use terraform iterative provider instead of CML to provision GCP instance? As this error is still causing my workflow to break. |
@bmabir17-asj - might be potentially fixed in iterative/terraform-provider-iterative#607 - @DavidGOrtega should be able to confirm if this is the case. |
@bmabir17-asj sorry for the delay, this should be fixed for you and the driver setup is more stable without any workarounds, if you run into anything else let us know. |
Is this resolved? |
I'm not up to speed on the original docker/Nvidia issue. but the subsequently discuss GCP/Nvidia issue has been resolved. |
Closing for now then :) |
@dacbd thanks for the help
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Daniel Barnes ***@***.***>
Sent: Tuesday, June 14, 2022 11:38:13 PM
To: iterative/cml ***@***.***>
Cc: Abir BM ***@***.***>; Mention ***@***.***>
Subject: Re: [iterative/cml] NVIDIA drivers or nvidia-docker issues (Issue NVIDIA/nvidia-docker#860)
@bmabir17-asj<https://github.com/bmabir17-asj> can do, I'll plan to build an example for you (my tomorrow morning).
—
Reply to this email directly, view it on GitHub<#860 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATOW4Z6T7CYXMQJVOQ7L32DVPC7QLANCNFSM5LMAUPTQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Valuable information:
|
Comming from discord
Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
The text was updated successfully, but these errors were encountered: