Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Make the controller resources configurable #2040

Merged
merged 11 commits into from
Jun 7, 2023
11 changes: 4 additions & 7 deletions docs/source/examples/spot-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -242,17 +242,16 @@ you can still tear it down manually with
.. note::
Tearing down the spot controller will lose all logs and status information for the spot jobs and can cause resource leakage when there are still in-progress spot jobs.

Customize spot controller resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Customizing spot controller resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Although the default setting works for most of people, you may want to customize the resources of the spot controller for the following reasons:
You may customize the resources of the spot controller for the following reasons:

1. Enforcing the spot controller to run on a specific location. (Default: cheapest location)
2. Changing the maximum number of spot jobs that can be run concurrently. (Default: 16)
3. Changing the disk_size of the spot controller to store more logs. (Default: 50GB)
4. Using a specific instance type for the spot controller. (Default: choose automatically)

To achieve the above goals, you can specify the configs in the :code:`~/.sky/skypilot_config.yaml` with the following fields (the :code:`resources` field has the same spec as a normal SkyPilot job, see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`_):
To achieve the above, you can specify custom configs in :code:`~/.sky/config.yaml` with the following fields (the :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`_):

.. code-block:: yaml

Expand All @@ -267,8 +266,6 @@ To achieve the above goals, you can specify the configs in the :code:`~/.sky/sky
cpus: 4+ # number of vCPUs, max # spot jobs = 2 * cpus
# 3. Specify the disk_size of the spot controller.
disk_size: 100
# 4. Specify the instance type of the spot controller.
instance_type: n1-standard-4



19 changes: 11 additions & 8 deletions sky/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -673,11 +673,13 @@ def spot_launch(
controller_resources = sky.Resources.from_yaml_config(
controller_resources_config)
except ValueError as e:
raise ValueError(
'Spot controller resources is not valid, please check '
'~/.sky/skypilot_config.yaml file. Details:\n'
f' {common_utils.format_exception(e, use_bracket=True)}'
) from e
with ux_utils.print_exception_no_traceback():
raise ValueError(
'Spot controller resources is not valid, please check '
'~/.sky/config.yaml file and make sure it\'s a '
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
'valid resources spec. Details:\n'
f' {common_utils.format_exception(e, use_bracket=True)}'
) from e

yaml_path = os.path.join(spot.SPOT_CONTROLLER_YAML_PREFIX,
f'{name}-{task_uuid}.yaml')
Expand All @@ -686,10 +688,11 @@ def spot_launch(
output_path=yaml_path)
controller_task = task_lib.Task.from_yaml(yaml_path)
assert len(controller_task.resources) == 1, controller_task
# Backward compatibility: if the user changed the spot-controller.yaml.j2
# to customize the controller resources, we should use it.
# Backward compatibility: if the user changed the
# spot-controller.yaml.j2 to customize the controller resources,
# we should use it.
controller_task_resources = list(controller_task.resources)[0]
if not controller_task_resources.is_same_resources(sky.Resources()):
if not controller_task_resources.is_empty():
controller_resources = controller_task_resources
controller_task.set_resources(controller_resources)

Expand Down
5 changes: 2 additions & 3 deletions sky/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -799,9 +799,6 @@ def less_demanding_than(self,
# self <= other
return True

def is_same_resources(self, other: 'Resources') -> bool:
return self.to_yaml_config() == other.to_yaml_config()

def should_be_blocked_by(self, blocked: 'Resources') -> bool:
"""Whether this Resources matches the blocked Resources.

Expand Down Expand Up @@ -834,6 +831,8 @@ def is_empty(self) -> bool:
self.accelerators is None,
self.accelerator_args is None,
not self._use_spot_specified,
self.disk_size == _DEFAULT_DISK_SIZE_GB,
self._image_id is None,
])

def copy(self, **override) -> 'Resources':
Expand Down
4 changes: 4 additions & 0 deletions sky/spot/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,8 @@
SPOT_FM_LOCAL_TMP_DIR = 'skypilot-filemounts-files-{id}'
SPOT_FM_REMOTE_TMP_DIR = '/tmp/sky-spot-filemounts-files'

# It is now using default CPU instance type for spot controller, i.e.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
# m6i.2xlarge (8vCPUs, 32 GB) for AWS, Standard_D8s_v4 (8vCPUs, 32 GB)
# for Azure, and n1-standard-8 (8 vCPUs, 32 GB) for GCP.
# We use 50 GB disk size to reduce the cost.
CONTROLLER_RESOURCES = {'disk_size': 50}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc?