Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support choosing cloud for Spot controller #3363

Merged
merged 16 commits into from
Apr 23, 2024
Merged

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Mar 25, 2024

Addresses the new comment under #3231.

Tested (run the relevant ones):

service:
  readiness_probe: /
  replicas: 1

resources:
  ports: 8000
  any_of:
    - cloud: aws
    - cloud: gcp
    - cloud: kubernetes

run: python -m http.server 8000
resources:
  any_of:
    - cloud: aws
    - cloud: gcp
run: echo "Hello, World!"
---
resources:
  any_of:
    - cloud: aws
    - cloud: runpod
run: echo "Hello, World!"
---
resources:
  any_of:
    - cloud: azure
    - cloud: gcp
    - cloud: kubernetes
run: echo "Hello, World!"
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the behavior for the controller resources selection @cblmemo! Left several comments.

sky/serve/core.py Outdated Show resolved Hide resolved
sky/serve/core.py Outdated Show resolved Hide resolved
Comment on lines 315 to 332
# If the controller and replicas are from the same cloud, it should
# provide better connectivity. We will let the controller choose from
# the clouds of the resources if the controller does not exist.
# TODO(tian): Consider respecting the regions/zones specified for the
# resources as well.
requested_clouds: Set['clouds.Cloud'] = set()
for res in task_resources:
# cloud is an object and will not be able to be distinguished by set.
# Here we manually check if the cloud is in the set.
if res.cloud is not None and not clouds.cloud_in_iterable(
res.cloud, requested_clouds):
requested_clouds.add(res.cloud)
if not requested_clouds:
return {controller_resources_to_use}
return {
controller_resources_to_use.copy(cloud=controller_cloud)
for controller_cloud in requested_clouds
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the service is on Fluidstack, what will happen? Does that mean we will have a controller with GPU? This is not expected. Maybe we should blacklist some clouds that only has GPU instances.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Could you help double-check if the list is comprehensive? Currently I only know RunPod, FluidStack and Lambda

Comment on lines 310 to 313
controller_exist = (global_user_state.get_cluster_from_name(controller_name)
is not None)
if controller_exist or controller_resources_to_use.cloud is not None:
return {controller_resources_to_use}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't have such check in this previous function? Does that mean a custom resource will cause an existing controller fail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have such a check. Here I did a refactor...

skypilot/sky/serve/core.py

Lines 177 to 181 in 99d0aff

if (not controller_exist and
controller_resources_in_config.cloud is None):
controller_clouds = requested_clouds
else:
controller_clouds = {controller_resources_in_config.cloud}

@cblmemo cblmemo requested a review from Michaelvll March 31, 2024 04:33
if res.cloud is not None and not clouds.cloud_in_iterable(
res.cloud, requested_clouds):
if not clouds.cloud_in_iterable(res.cloud,
_CONTROLLER_CLOUD_BLACKLIST):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #3377, we adds the HOST_CONTROLLER implementation feature in the Cloud to filter out the clouds that do not suitable for hosting controllers. We might want to do similar thing, instead of having another list for the black list:

HOST_CONTROLLERS = 'host_controllers'

unsupported_features[
clouds.CloudImplementationFeatures.HOST_CONTROLLERS] = message

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Seems like #3377 is still under reviewing. Should we extract those part into a separate PR and merge it first? cc @romilbhardwaj for a look here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good @cblmemo - I'll create a separate PR for CloudImplementationFeatures.HOST_CONTROLLERS

@cblmemo cblmemo requested a review from Michaelvll April 7, 2024 02:51
@cblmemo
Copy link
Collaborator Author

cblmemo commented Apr 17, 2024

@Michaelvll bump for review when you got time 👀

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this @cblmemo! Left several comments : )

sky/serve/core.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
sky/utils/controller_utils.py Outdated Show resolved Hide resolved
tests/unit_tests/test_controller_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cblmemo! It seems the PR is ready to go. Could you test the case where both spot and serve controller do not exist?

Comment on lines +333 to +339
controller_resources_to_use: resources.Resources = list(
controller_resources)[0]

controller_exist = (global_user_state.get_cluster_from_name(
controller.value.name) is not None)
if controller_exist or controller_resources_to_use.cloud is not None:
return {controller_resources_to_use}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are returning a set of resources anyway, why do we only return the first one but not the whole controller_resources, i.e. allowing a user to specify multiple resources for the controller? (If needed, we can add a TODO here)

Also, just wondering, do we have some places depending on the results of the resources returned by this function to decide how many services we can run on the controller? In that case, will changing this to a set cause failure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not allowing a user to specify multiple resources for the controller?

That is a good point! I think it is at least feasible to support set of resources (i.e. any_of). One thing we need to discuss is what is the behaviour if the controller specified ordered resources w/o cloud specified and we want to override it with task resources cloud? Consider the following:

# ~/.sky/config.yaml
serve:
  controller:
    resources:
      ordered:
        - accelerators: L4
        - accelerators: T4
# service.yaml
resources:
  any_of:
    - cloud: aws
    - cloud: gcp

Should the controller resoruces be a list?

  • If it is a list, what is the order of aws and gcp?
  • If it is a set, does it break the semantic of ordered in controller resources?

do we have some places depending on the results of the resources returned by this function to decide how many services we can run on the controller?

No, this is automatically calculated from system memory. Reference here:

_SYSTEM_MEMORY_GB = psutil.virtual_memory().total // (1024**3)
NUM_SERVICE_THRESHOLD = (_SYSTEM_MEMORY_GB //
constants.CONTROLLER_MEMORY_USAGE_GB)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO first. Thanks!

Comment on lines 37 to 39
def test_get_controller_resources(controller_type,
custom_controller_resources_config, expected,
enable_all_clouds, monkeypatch):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add type hints?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added hint except enable_all_clouds and monkeypatch. Do we need to add type annotations for those fixture as well?

('spot', spot_constants.CONTROLLER_RESOURCES),
('serve', serve_constants.CONTROLLER_RESOURCES),
])
def test_get_controller_resources_with_task_resources(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@cblmemo
Copy link
Collaborator Author

cblmemo commented Apr 23, 2024

Thanks for the update @cblmemo! It seems the PR is ready to go. Could you test the case where both spot and serve controller do not exist?

Re-tested the YAMLs in the PR description with no controller exists and it works well 🫡

@cblmemo cblmemo merged commit cc1c58b into master Apr 23, 2024
20 checks passed
@cblmemo cblmemo deleted the fix-controller-cloud branch April 23, 2024 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve] Fail to launch controller when no resources is specified for a service when firstly used
3 participants