Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal GCP GPU example #1291

Closed
evamaxfield opened this issue Dec 20, 2022 · 4 comments
Closed

Minimal GCP GPU example #1291

evamaxfield opened this issue Dec 20, 2022 · 4 comments
Assignees
Labels
cloud-gcp Google Cloud cml-runner Subcommand external-request You asked, we did question User requesting support

Comments

@evamaxfield
Copy link

evamaxfield commented Dec 20, 2022

Hello!

I have tried to get CML working with GCP with a GPU runner to unfortunately limited success.

I think I have read most of the GitHub issues related to GCP + GPU configurations:

And I have tried many configurations of:

  • Seemingly as minimal as possible
  • Loading a specific docker container once the runner is ready
  • Different GPUs and different machine types
  • Different Regions / zones

Through various GitHub Issues I have arrived at this current "minimal testing action" that spins up an GCP instance and the runner gets connected as a self-hosted runner in the next job: https://github.com/evamaxfield/gcloud-whisper-testing/blob/main/.github/workflows/runner.yml#L56. I can print the working directory but as soon as I try nvidia-smi the entire runner is stopped: https://github.com/evamaxfield/gcloud-whisper-testing/actions/runs/3737891152/jobs/6343531827

It would really help to have a minimal GCP example in the documentation. Unless I have configured something incorrectly, I just can't get it to work.

@evamaxfield
Copy link
Author

I am happy to make a PR with this documentation / example as well. Just trying to figure out what is going wrong right now.

@dacbd
Copy link
Contributor

dacbd commented Dec 20, 2022

I'll see if I can replicate the error you are having, nothing is jumping out to me with what you shared.

you can check some of our public but not really polished e2e tests: https://github.com/iterative/cml-playground/blob/main/.github/workflows/cml-1049.yml for another example

@evamaxfield
Copy link
Author

While I couldn't get the exact example (and a couple of variants) you provided working (I assume to different permissions on tokens): https://github.com/evamaxfield/gcloud-whisper-testing/actions/runs/3752543832/jobs/6374822001

I was able to get something working: https://github.com/evamaxfield/gcloud-whisper-testing/actions/runs/3752736669/jobs/6375333529

I have tried a few configurations to try and debug what was going wrong but I couldnt find an obvious one: "low memory machine -> runs out of mem and crashes" or "low storage -> runs out of storage while downloading drivers". All seem to work today.

I will keep testing stuff and let you know but I think this can be closed.

@dacbd dacbd added question User requesting support cml-runner Subcommand cloud-gcp Google Cloud labels Dec 21, 2022
@dacbd
Copy link
Contributor

dacbd commented Dec 21, 2022

@evamaxfield let us know if you encounter an issue. BTW when you add the json key to the GitHub actions secret store, delete the newlines before and after the { } that way you don't get annyoning *** (json masking) in your action's logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-gcp Google Cloud cml-runner Subcommand external-request You asked, we did question User requesting support
Projects
None yet
Development

No branches or pull requests

3 participants