-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support choosing cloud for Spot controller #3363
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the behavior for the controller resources selection @cblmemo! Left several comments.
sky/utils/controller_utils.py
Outdated
# If the controller and replicas are from the same cloud, it should | ||
# provide better connectivity. We will let the controller choose from | ||
# the clouds of the resources if the controller does not exist. | ||
# TODO(tian): Consider respecting the regions/zones specified for the | ||
# resources as well. | ||
requested_clouds: Set['clouds.Cloud'] = set() | ||
for res in task_resources: | ||
# cloud is an object and will not be able to be distinguished by set. | ||
# Here we manually check if the cloud is in the set. | ||
if res.cloud is not None and not clouds.cloud_in_iterable( | ||
res.cloud, requested_clouds): | ||
requested_clouds.add(res.cloud) | ||
if not requested_clouds: | ||
return {controller_resources_to_use} | ||
return { | ||
controller_resources_to_use.copy(cloud=controller_cloud) | ||
for controller_cloud in requested_clouds | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service is on Fluidstack, what will happen? Does that mean we will have a controller with GPU? This is not expected. Maybe we should blacklist some clouds that only has GPU instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Could you help double-check if the list is comprehensive? Currently I only know RunPod, FluidStack and Lambda
sky/utils/controller_utils.py
Outdated
controller_exist = (global_user_state.get_cluster_from_name(controller_name) | ||
is not None) | ||
if controller_exist or controller_resources_to_use.cloud is not None: | ||
return {controller_resources_to_use} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we don't have such check in this previous function? Does that mean a custom resource will cause an existing controller fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have such a check. Here I did a refactor...
Lines 177 to 181 in 99d0aff
if (not controller_exist and | |
controller_resources_in_config.cloud is None): | |
controller_clouds = requested_clouds | |
else: | |
controller_clouds = {controller_resources_in_config.cloud} |
sky/utils/controller_utils.py
Outdated
if res.cloud is not None and not clouds.cloud_in_iterable( | ||
res.cloud, requested_clouds): | ||
if not clouds.cloud_in_iterable(res.cloud, | ||
_CONTROLLER_CLOUD_BLACKLIST): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #3377, we adds the HOST_CONTROLLER
implementation feature in the Cloud
to filter out the clouds that do not suitable for hosting controllers. We might want to do similar thing, instead of having another list for the black list:
Line 44 in 87330bd
HOST_CONTROLLERS = 'host_controllers' |
skypilot/sky/clouds/kubernetes.py
Lines 105 to 106 in 87330bd
unsupported_features[ | |
clouds.CloudImplementationFeatures.HOST_CONTROLLERS] = message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Seems like #3377 is still under reviewing. Should we extract those part into a separate PR and merge it first? cc @romilbhardwaj for a look here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good @cblmemo - I'll create a separate PR for CloudImplementationFeatures.HOST_CONTROLLERS
@Michaelvll bump for review when you got time 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this @cblmemo! Left several comments : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @cblmemo! It seems the PR is ready to go. Could you test the case where both spot and serve controller do not exist?
controller_resources_to_use: resources.Resources = list( | ||
controller_resources)[0] | ||
|
||
controller_exist = (global_user_state.get_cluster_from_name( | ||
controller.value.name) is not None) | ||
if controller_exist or controller_resources_to_use.cloud is not None: | ||
return {controller_resources_to_use} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are returning a set of resources anyway, why do we only return the first one but not the whole controller_resources
, i.e. allowing a user to specify multiple resources for the controller? (If needed, we can add a TODO here)
Also, just wondering, do we have some places depending on the results of the resources returned by this function to decide how many services we can run on the controller? In that case, will changing this to a set cause failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not allowing a user to specify multiple resources for the controller?
That is a good point! I think it is at least feasible to support set of resources (i.e. any_of
). One thing we need to discuss is what is the behaviour if the controller specified ordered
resources w/o cloud
specified and we want to override it with task resources cloud? Consider the following:
# ~/.sky/config.yaml
serve:
controller:
resources:
ordered:
- accelerators: L4
- accelerators: T4
# service.yaml
resources:
any_of:
- cloud: aws
- cloud: gcp
Should the controller resoruces be a list?
- If it is a list, what is the order of
aws
andgcp
? - If it is a set, does it break the semantic of
ordered
in controller resources?
do we have some places depending on the results of the resources returned by this function to decide how many services we can run on the controller?
No, this is automatically calculated from system memory. Reference here:
skypilot/sky/serve/serve_utils.py
Lines 43 to 45 in 48a5c63
_SYSTEM_MEMORY_GB = psutil.virtual_memory().total // (1024**3) | |
NUM_SERVICE_THRESHOLD = (_SYSTEM_MEMORY_GB // | |
constants.CONTROLLER_MEMORY_USAGE_GB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO first. Thanks!
def test_get_controller_resources(controller_type, | ||
custom_controller_resources_config, expected, | ||
enable_all_clouds, monkeypatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add type hints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added hint except enable_all_clouds
and monkeypatch
. Do we need to add type annotations for those fixture as well?
('spot', spot_constants.CONTROLLER_RESOURCES), | ||
('serve', serve_constants.CONTROLLER_RESOURCES), | ||
]) | ||
def test_get_controller_resources_with_task_resources( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Re-tested the YAMLs in the PR description with no controller exists and it works well 🫡 |
Addresses the new comment under #3231.
Tested (run the relevant ones):
bash format.sh
pytest tests/unit_tests/test_controller_utils.py
with [Core][BugFix] Remove extra field_require_fuse
inResources.to_yaml_config()
#3457{AWS, GCP}
/{AWS, GCP, Azure}
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh