-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP: allow stop/autostop for spot VMs. #2877
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support of stopping a spot instance on GCP @concretevitamin! The code looks mostly good to me! Left two nits.
sky/core.py
Outdated
if handle.launched_resources.use_spot and not cloud.is_same_cloud( | ||
clouds.GCP()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we make this as a CloudImplementationFeatures.STOP_SPOT_INSTANCE
, so we don't have to do a cloud specific check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done.
sky/core.py
Outdated
elif (handle.launched_resources.use_spot and not down and not is_cancel and | ||
not handle.launched_resources.cloud.is_same_cloud(clouds.GCP())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, should we make this one of the CloudImplementationFeatures
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @concretevitamin! It looks mostly good to me with a comment for a refactoring of the check_features_are_supported
function.
sky/core.py
Outdated
cloud.check_features_are_supported( | ||
{clouds.CloudImplementationFeatures.STOP}) | ||
if handle.launched_resources.use_spot: | ||
# Check cloud supports stopping spot instances | ||
supports_stop_spot = True | ||
try: | ||
cloud.check_features_are_supported( | ||
{clouds.CloudImplementationFeatures.STOP_SPOT_INSTANCE}) | ||
except exceptions.NotSupportedError: | ||
supports_stop_spot = False | ||
# Allow GCP spot to be stopped since it preserves disk: | ||
# https://cloud.google.com/compute/docs/instances/preemptible#preemption-process # pylint: disable=line-too-long | ||
if handle.launched_resources.use_spot and not supports_stop_spot: | ||
# Disable spot instances to be stopped. | ||
# TODO(suquark): enable GCP+spot to be stopped in the future. | ||
raise exceptions.NotSupportedError( | ||
f'{colorama.Fore.YELLOW}Stopping cluster ' | ||
f'{cluster_name!r}... skipped.{colorama.Style.RESET_ALL}\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following for simplicity?
features = {clouds.CloudImplementationFeatures.STOP}
if handle.launched_resources.use_spot:
features.insert(clouds.CloudImplementationFeatures.STOP_SPOT_INSTANCE)
cloud.check_features_are_supported(features)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since check_features_are_supported() raises an error I found it harder to read. For example, if a cloud doesn't support stopping spot and here the cluster is an on-demand one, we do not want to perform a check on STOP_SPOT_INSTANCE or raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some changes. PTAL.
sky/core.py
Outdated
cloud.check_features_are_supported( | ||
{clouds.CloudImplementationFeatures.STOP}) | ||
if handle.launched_resources.use_spot: | ||
# Check cloud supports stopping spot instances | ||
supports_stop_spot = True | ||
try: | ||
cloud.check_features_are_supported( | ||
{clouds.CloudImplementationFeatures.STOP_SPOT_INSTANCE}) | ||
except exceptions.NotSupportedError: | ||
supports_stop_spot = False | ||
# Allow GCP spot to be stopped since it preserves disk: | ||
# https://cloud.google.com/compute/docs/instances/preemptible#preemption-process # pylint: disable=line-too-long | ||
if handle.launched_resources.use_spot and not supports_stop_spot: | ||
# Disable spot instances to be stopped. | ||
# TODO(suquark): enable GCP+spot to be stopped in the future. | ||
raise exceptions.NotSupportedError( | ||
f'{colorama.Fore.YELLOW}Stopping cluster ' | ||
f'{cluster_name!r}... skipped.{colorama.Style.RESET_ALL}\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since check_features_are_supported() raises an error I found it harder to read. For example, if a cloud doesn't support stopping spot and here the cluster is an on-demand one, we do not want to perform a check on STOP_SPOT_INSTANCE or raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix @concretevitamin! The code looks mostly good to me, except for when we raise an error for the STOP_SPOT_INSTANCE being unsupported.
@concretevitamin I refactored the handling for checking supported features. Please take a look : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll for the refactoring! Failover works correctly now.
I pushed some minor changes in messages. PTAL.
sky/clouds/gcp.py
Outdated
@@ -809,8 +817,6 @@ def need_cleanup_after_preemption(self, | |||
# you must delete it and create a new one ..." | |||
# See: https://cloud.google.com/tpu/docs/preemptible#tpu-vm | |||
|
|||
# pylint: disable=import-outside-toplevel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Was there a reason this was imported inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it was for avoiding circular import, but since we have updated a lot of places, it seems fine to import it at the top level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. :)
Fixes #2837.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_stop_gcp_spot
bash tests/backward_comaptibility_tests.sh