Minimal GCP GPU example #1291

evamaxfield · 2022-12-20T18:13:20Z

Hello!

I have tried to get CML working with GCP with a GPU runner to unfortunately limited success.

I think I have read most of the GitHub issues related to GCP + GPU configurations:

And I have tried many configurations of:

Seemingly as minimal as possible
Loading a specific docker container once the runner is ready
Different GPUs and different machine types
Different Regions / zones

Through various GitHub Issues I have arrived at this current "minimal testing action" that spins up an GCP instance and the runner gets connected as a self-hosted runner in the next job: https://github.com/evamaxfield/gcloud-whisper-testing/blob/main/.github/workflows/runner.yml#L56. I can print the working directory but as soon as I try nvidia-smi the entire runner is stopped: https://github.com/evamaxfield/gcloud-whisper-testing/actions/runs/3737891152/jobs/6343531827

It would really help to have a minimal GCP example in the documentation. Unless I have configured something incorrectly, I just can't get it to work.

The text was updated successfully, but these errors were encountered:

evamaxfield · 2022-12-20T18:14:35Z

I am happy to make a PR with this documentation / example as well. Just trying to figure out what is going wrong right now.

dacbd · 2022-12-20T18:25:25Z

I'll see if I can replicate the error you are having, nothing is jumping out to me with what you shared.

you can check some of our public but not really polished e2e tests: https://github.com/iterative/cml-playground/blob/main/.github/workflows/cml-1049.yml for another example

evamaxfield · 2022-12-21T21:36:36Z

While I couldn't get the exact example (and a couple of variants) you provided working (I assume to different permissions on tokens): https://github.com/evamaxfield/gcloud-whisper-testing/actions/runs/3752543832/jobs/6374822001

I was able to get something working: https://github.com/evamaxfield/gcloud-whisper-testing/actions/runs/3752736669/jobs/6375333529

I have tried a few configurations to try and debug what was going wrong but I couldnt find an obvious one: "low memory machine -> runs out of mem and crashes" or "low storage -> runs out of storage while downloading drivers". All seem to work today.

I will keep testing stuff and let you know but I think this can be closed.

dacbd · 2022-12-21T21:49:03Z

@evamaxfield let us know if you encounter an issue. BTW when you add the json key to the GitHub actions secret store, delete the newlines before and after the { } that way you don't get annyoning *** (json masking) in your action's logs.

evamaxfield closed this as completed Dec 21, 2022

dacbd added question User requesting support cml-runner Subcommand cloud-gcp Google Cloud labels Dec 21, 2022

evamaxfield mentioned this issue Jan 8, 2023

Feature: Install GCP Ops Agent Automatically #1296

Open

casperdcl added the external-request You asked, we did label Jan 12, 2023

casperdcl assigned dacbd and evamaxfield Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal GCP GPU example #1291

Minimal GCP GPU example #1291

evamaxfield commented Dec 20, 2022 •

edited

Loading

evamaxfield commented Dec 20, 2022

dacbd commented Dec 20, 2022

evamaxfield commented Dec 21, 2022

dacbd commented Dec 21, 2022

Minimal GCP GPU example #1291

Minimal GCP GPU example #1291

Comments

evamaxfield commented Dec 20, 2022 • edited Loading

evamaxfield commented Dec 20, 2022

dacbd commented Dec 20, 2022

evamaxfield commented Dec 21, 2022

dacbd commented Dec 21, 2022

evamaxfield commented Dec 20, 2022 •

edited

Loading