-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Make the controller resources configurable #2040
Conversation
sky/execution.py
Outdated
# Backward compatibility: if the user changed the spot-controller.yaml.j2 | ||
# to customize the controller resources, we should use it. | ||
controller_task_resources = list(controller_task.resources)[0] | ||
if not controller_task_resources.is_same_resources(sky.Resources()): | ||
controller_resources = controller_task_resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These may not be necessary, but some of our users should have changed the spot-controller.yaml.j2
to make their jobs work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll, mostly looks good to me.
Q: what's the error shown (if any) if a user has a stopped controller on cloud X, and now spot launch
with a different resources in the config?
@@ -12,3 +12,5 @@ | |||
SPOT_FM_FILE_ONLY_BUCKET_NAME = 'skypilot-filemounts-files-{username}-{id}' | |||
SPOT_FM_LOCAL_TMP_DIR = 'skypilot-filemounts-files-{id}' | |||
SPOT_FM_REMOTE_TMP_DIR = '/tmp/sky-spot-filemounts-files' | |||
|
|||
CONTROLLER_RESOURCES = {'disk_size': 50} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc?
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…into config-controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @concretevitamin!
If the user manually change the cloud for the spot controller, when the controller exists on another cloud. The following error will be shown:
> sky spot launch -n test2 echo hi
Task from command: echo hi
Launching a new spot task 'test2'. Proceed? [Y/n]:
Launching managed spot job test2 from spot controller...
Launching spot controller...
sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
Requested: 1x AWS(cpus=4+, disk_size=50)
Existing: 1x GCP(n2-standard-4, disk_size=50)
To fix: specify a new cluster name, or down the existing cluster first: sky down sky-spot-controller-9ce1ce58
I think it should be fine to show that. Wdyt?
LGTM, we can see if users have any feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Closing #2007, #1974, #1078.
It also mitigates #1094, as the user now can specify more CPUs for the controller.
With this PR, the user can specify resources spec for the spot controller so as to change the default cloud, customize the disk size or change the number of CPU cores (the maximum number of spot jobs == 2x #CPUs).
To customize the spot controller resources, the user can have a
~/.sky/config.yaml
with the following fields:Tested (run the relevant ones):
sky spot launch -n test echo hi
with the~/.sky/skypilot_config.yaml
above.sky spot launch -n test2 echo hi
run again to test the availability to launch a second job.pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh