Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gcp quota optimization #2187

Merged

Conversation

shethhriday29
Copy link
Contributor

@shethhriday29 shethhriday29 commented Jul 7, 2023

Description:

  • Drastically reduce failover time when we attempt to provision accelerators in an GCP region where the region's quota for those accelerators is zero
  • This involved directly querying every region's accelerator quota immediately before provision, and if the query returns a 0 value, skipping the provision attempt (and potentially moving onto the next region as a result)
  • Used GCP's CLI commands for query
  • Reduces the failover in these zero-quota cases from 30 seconds (failed provisioning attempt) to <1 second (time for CLI command to run)
  • Had to add a file to the catalog (skypilot-catalog, me/@shethhriday29 opened a corresponding PR there) to map accelerators to their quota keywords, which are needed in the CLI commands

Tested (run the relevant ones):

  • Created new smoke test, test_gcp_zero_quota_failover, that ensures that the zero quota checker works
  • Manually tested functionality by attempting to (across multiple types of accelerators):
  1. Provision resources in regions with zero quota, and ensuring that the automatic failover happens
  2. Provision resources in non-zero but still not-adequate quota, and ensuring that the provision is still attempted
  3. Provision resources in a region with adequate quota and ensuring the provision is successful

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@shethhriday29 shethhriday29 marked this pull request as ready for review July 11, 2023 04:34
@romilbhardwaj romilbhardwaj self-requested a review July 12, 2023 15:05
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shethhriday29! I'm thinking of refactoring check_quota_available() a bit. What do you think?

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @shethhriday29! Took a look at the code. Haven't tested yet.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/clouds/gcp.py Show resolved Hide resolved
sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/constants.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
tests/backward_compatibility_tests.sh Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @shethhriday29! Works nicely. Found a bug in smoke tests (it was passing silently) - should be good to go after that's fixed.

sky/clouds/gcp.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @shethhriday29! This will save a ton of time during failover on GCP 🚀

@romilbhardwaj romilbhardwaj merged commit c1fd7a9 into skypilot-org:master Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants