Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Failover Optimization from Quota #1953

Merged

Conversation

shethhriday29
Copy link
Contributor

@shethhriday29 shethhriday29 commented May 12, 2023

Description:

  • Drastically reduce failover time when we attempt to provision accelerators in an AWS region where the region's quota for those accelerators is zero
  • This involved directly querying every region's accelerator quota immediately before provision, and if the query returns a 0 value, skipping the provision attempt (and potentially moving onto the next region as a result)
  • Used AWS's boto3 Python API for query
  • Reduces the failover in these zero-quota cases from 30 seconds (failed provisioning attempt) to <1 second (time for boto3 query to return)
  • Had to add a file to the catalog (skypilot-catalog, me/@shethhriday29 opened a corresponding PR there) to map accelerators to their quota codes, which are needed in the boto3 queries

Tested (run the relevant ones):

  • Ran all AWS smoke tests
  • Created new smoke test, test_aws_zero_quota_failover, that ensures that the zero quota checker works
  • Worked with Romil to test backward compatibility with new catalog changes
  • Manually tested functionality by attempting to (across multiple types of accelerators):
  1. Provision resources in regions with zero quota, and ensuring that the automatic failover happens
  2. Provision resources in non-zero but still not-adequate quota, and ensuring that the provision is still attempted
  3. Provision resources in a region with adequate quota and ensuring the provision is successful

@romilbhardwaj romilbhardwaj self-requested a review May 12, 2023 01:04
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @shethhriday29, and welcome to SkyPilot! Took a quick glance and left some high-level comments. Will take a closer look later. Also recommend running ./format.sh to automatically format your code to keep the style checker happy.

sky/clouds/service_catalog/constants.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Show resolved Hide resolved
sky/clouds/cloud.py Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @shethhriday29! Left some comments.

sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready to go! Thanks for the awesome work @shethhriday29 - this will make failover much faster for AWS! 🚀

@Michaelvll Michaelvll self-requested a review June 21, 2023 20:25
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @shethhriday29! It is exciting to see the optimization. Left several nits about code style and robustness. The PR should be ready to go, after the comments are fixed and the tests are passed.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
region: str,
instance_type: str,
use_spot: bool = False) -> bool:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: Having a single line description of this function in the first line.

Suggested change
"""
"""Check if the quota is available for the requested instance_type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google styleguide: First line should be on the same line as the triple quotes. https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Sheth and others added 4 commits June 21, 2023 21:16
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
@shethhriday29 shethhriday29 changed the title Aws quota optimization [AWS] Failover Optimization from Quota Jun 22, 2023
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @shethhriday29! Left a final comment about the importing. Otherwise, it looks good to me.

sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
shethhriday29 and others added 5 commits June 22, 2023 23:29
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @shethhriday29! LGTM.

@concretevitamin
Copy link
Member

This is awesome @shethhriday29. A few UX notes:

  1. sky launch --gpus A100:8 --use-spot --down --cloud aws prints quota warnings like
W 06-23 09:09:45 cloud_vm_ray_backend.py:1947] sky.exceptions.ResourcesUnavailableError: Found no quota for p4d.24xlarge in region ap-northeast-2. To request quotas, check the instruction: https://skypilot.readthedocs.io/en/latest/cloud-setup/quota.html.

However, removing the --use-spot flag I could successfully provision.

A few ideas to improve this logging:

  • Say it's the spot quota that's out; not on-demand
  • Even better, would it be possible to print the precise quota text used in AWS quota console, such as All P Spot Instance Requests vs. Running On-Demand P instances?
  1. In the quota warnings above, it may be better to make the warning yellow. Otherwise it's a bit hard to spot.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @shethhriday29! LGTM.

region: str,
instance_type: str,
use_spot: bool = False) -> bool:
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google styleguide: First line should be on the same line as the triple quotes. https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

sky/clouds/aws.py Show resolved Hide resolved
instance_type: str,
use_spot: bool = False) -> bool:
"""
Checks to ensure that a particular accelerator has a nonzero quota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (styleguide)

tests/test_smoke.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator

Thanks @shethhriday29! Looks like this is ready to go in. Merging now!

@romilbhardwaj romilbhardwaj merged commit 26e30ce into skypilot-org:master Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants