Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Make the controller resources configurable #2040

Merged
merged 11 commits into from
Jun 7, 2023
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 6, 2023

Closing #2007, #1974, #1078.

It also mitigates #1094, as the user now can specify more CPUs for the controller.

With this PR, the user can specify resources spec for the spot controller so as to change the default cloud, customize the disk size or change the number of CPU cores (the maximum number of spot jobs == 2x #CPUs).

To customize the spot controller resources, the user can have a ~/.sky/config.yaml with the following fields:

spot:
  controller:
    resources:
      cloud: gcp
      cpus: 20+

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • sky spot launch -n test echo hi with the ~/.sky/skypilot_config.yaml above.
    • sky spot launch -n test2 echo hi run again to test the availability to launch a second job.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll changed the title Make the controller resources configurable [Spot] Make the controller resources configurable Jun 6, 2023
sky/execution.py Outdated
Comment on lines 689 to 693
# Backward compatibility: if the user changed the spot-controller.yaml.j2
# to customize the controller resources, we should use it.
controller_task_resources = list(controller_task.resources)[0]
if not controller_task_resources.is_same_resources(sky.Resources()):
controller_resources = controller_task_resources
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These may not be necessary, but some of our users should have changed the spot-controller.yaml.j2 to make their jobs work.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll, mostly looks good to me.

Q: what's the error shown (if any) if a user has a stopped controller on cloud X, and now spot launch with a different resources in the config?

@@ -12,3 +12,5 @@
SPOT_FM_FILE_ONLY_BUCKET_NAME = 'skypilot-filemounts-files-{username}-{id}'
SPOT_FM_LOCAL_TMP_DIR = 'skypilot-filemounts-files-{id}'
SPOT_FM_REMOTE_TMP_DIR = '/tmp/sky-spot-filemounts-files'

CONTROLLER_RESOURCES = {'disk_size': 50}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc?

sky/execution.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
docs/source/examples/spot-jobs.rst Outdated Show resolved Hide resolved
docs/source/examples/spot-jobs.rst Outdated Show resolved Hide resolved
docs/source/examples/spot-jobs.rst Outdated Show resolved Hide resolved
docs/source/examples/spot-jobs.rst Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
Michaelvll and others added 5 commits June 6, 2023 13:55
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Copy link
Collaborator Author

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @concretevitamin!

If the user manually change the cloud for the spot controller, when the controller exists on another cloud. The following error will be shown:

> sky spot launch -n test2 echo hi                                          

Task from command: echo hi
Launching a new spot task 'test2'. Proceed? [Y/n]:
Launching managed spot job test2 from spot controller...
Launching spot controller...

sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
  Requested:    1x AWS(cpus=4+, disk_size=50) 
  Existing:     1x GCP(n2-standard-4, disk_size=50)
To fix: specify a new cluster name, or down the existing cluster first: sky down sky-spot-controller-9ce1ce58

I think it should be fine to show that. Wdyt?

docs/source/examples/spot-jobs.rst Outdated Show resolved Hide resolved
docs/source/examples/spot-jobs.rst Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
@concretevitamin
Copy link
Member

I think it should be fine to show that. Wdyt?

LGTM, we can see if users have any feedback.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll.

sky/execution.py Outdated Show resolved Hide resolved
sky/spot/constants.py Outdated Show resolved Hide resolved
Michaelvll and others added 3 commits June 6, 2023 17:27
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
@Michaelvll Michaelvll merged commit d6a6808 into master Jun 7, 2023
@Michaelvll Michaelvll deleted the config-controller branch June 7, 2023 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants