Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Upgrade ray to 2.9.3 to support python 3.11 #3248

Merged
merged 35 commits into from
Feb 29, 2024
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
4d5679a
Update ray to 2.9.3
Michaelvll Feb 28, 2024
d285f33
disable logging for ray
Michaelvll Feb 28, 2024
9b7d8c9
add comment
Michaelvll Feb 28, 2024
7fe5756
Merge branch 'master' of github.com:skypilot-org/skypilot into update…
Michaelvll Feb 28, 2024
2830311
Fix azure setup
Michaelvll Feb 28, 2024
d5e422b
Fix pydantic
Michaelvll Feb 28, 2024
cc0cb54
remove unused design doc
Michaelvll Feb 28, 2024
8acc37a
add num-gpus
Michaelvll Feb 28, 2024
79be961
fix typo
Michaelvll Feb 28, 2024
fa8f043
refactor ray installation
Michaelvll Feb 28, 2024
1a61cc2
Add backward compatibility
Michaelvll Feb 28, 2024
965fe30
add test for different docker images
Michaelvll Feb 28, 2024
ab3e1dd
Add ray status check back
Michaelvll Feb 28, 2024
44737fb
Add test for not restarting ray cluster
Michaelvll Feb 28, 2024
ecdd54f
Fix
Michaelvll Feb 28, 2024
b98915f
Fix smoke
Michaelvll Feb 29, 2024
729a794
Fix backward compat
Michaelvll Feb 29, 2024
01fee94
further fix of backward compat test
Michaelvll Feb 29, 2024
aa950c3
fix kubernetes memory check
Michaelvll Feb 29, 2024
c928a12
Fix backward compat
Michaelvll Feb 29, 2024
df002e5
fix source bashrc
Michaelvll Feb 29, 2024
0b64d68
fix fluidstack
Michaelvll Feb 29, 2024
b2356bc
revert
Michaelvll Feb 29, 2024
a60be81
backward compat fix
Michaelvll Feb 29, 2024
de44ba7
fix test for dynamic fallback
Michaelvll Feb 29, 2024
fdae819
fix
Michaelvll Feb 29, 2024
f13c84d
Address comments
Michaelvll Feb 29, 2024
f2c036f
fix
Michaelvll Feb 29, 2024
b4326f7
format
Michaelvll Feb 29, 2024
d30cfcf
address comments
Michaelvll Feb 29, 2024
264b3ed
fix
Michaelvll Feb 29, 2024
012b879
fix
Michaelvll Feb 29, 2024
af824c3
longer wait time for spot
Michaelvll Feb 29, 2024
ec2d8bc
cancel
Michaelvll Feb 29, 2024
d6d6e09
fix spot cancel
Michaelvll Feb 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -887,6 +887,13 @@ def write_cluster_config(
# Conda setup
'conda_installation_commands':
constants.CONDA_INSTALLATION_COMMANDS,
# We should not use `.format`, as it contains '{}' as the bash
# syntax.
'ray_skypilot_installation_commands':
(constants.RAY_SKYPILOT_INSTALLATION_COMMANDS.replace(
'{sky_wheel_hash}',
wheel_hash).replace('{cloud}',
str(cloud).lower())),

# Port of Ray (GCS server).
# Ray's default port 6379 is conflicted with Redis.
Expand Down
8 changes: 2 additions & 6 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,6 @@ def get_or_fail(futures, pg) -> List[int]:
# next job can be scheduled on the released resources immediately.
ray_util.remove_placement_group(pg)
sys.stdout.flush()
sys.stderr.flush()
return returncodes

run_fn = None
Expand Down Expand Up @@ -372,14 +371,12 @@ def add_gang_scheduling_placement_group_and_setup(
message = {_CTRL_C_TIP_MESSAGE!r} + '\\n'
message += f'INFO: Waiting for task resources on {{node_str}}. This will block if the cluster is full.'
print(message,
file=sys.stderr,
flush=True)
# FIXME: This will print the error message from autoscaler if
# it is waiting for other task to finish. We should hide the
# error message.
ray.get(pg.ready())
print('INFO: All task resources reserved.',
file=sys.stderr,
flush=True)
""")
]
Expand Down Expand Up @@ -427,7 +424,6 @@ def add_gang_scheduling_placement_group_and_setup(
print('ERROR: {colorama.Fore.RED}Job {self.job_id}\\'s setup failed with '
'return code list:{colorama.Style.RESET_ALL}',
setup_returncodes,
file=sys.stderr,
flush=True)
# Need this to set the job status in ray job to be FAILED.
sys.exit(1)
Expand Down Expand Up @@ -623,7 +619,6 @@ def add_epilogue(self) -> None:
'return code list:{colorama.Style.RESET_ALL}',
returncodes,
reason,
file=sys.stderr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What's the reason we previous printed these error logs to stderr but now to stdout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we redirect the output to stdout is that the latest ray will print out a bunch of useless output to stderr when the job is being cancelled, as shown in the PR description, we now redirect the stderr to /dev/null to get rid of those outputs.

I don't think we have a particular reason to have it print to stderr, but was just to print the error to stderr.

flush=True)
# Need this to set the job status in ray job to be FAILED.
sys.exit(1)
Expand Down Expand Up @@ -3139,7 +3134,8 @@ def _exec_code_on_head(
f'{cd} && ray job submit '
'--address=http://127.0.0.1:$RAY_DASHBOARD_PORT '
f'--submission-id {job_id}-$(whoami) --no-wait '
f'"{executable} -u {script_path} > {remote_log_path} 2>&1"')
# Redirect stderr to /dev/null to avoid distracting error from ray.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean

(sky-cmd, pid=4965) hi
(raylet) WARNING: 8 PYTHON worker processes have been started on node: 22029db2cf6bf02eadaf84cdd402beaef9a1795321880974f43b18ca with address: 172.31.84.137. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
*** SIGTERM received at time=1709110183 on cpu 0 ***
PC: @     0x7f22688eead8  (unknown)  pthread_cond_timedwait@@GLIBC_2.3.2
    @     0x7f22688f3140  (unknown)  (unknown)
    @ ... and at least 1 more frames
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: *** SIGTERM received at time=1709110183 on cpu 0 ***
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: PC: @     0x7f22688eead8  (unknown)  pthread_cond_timedwait@@GLIBC_2.3.2
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361:     @     0x7f22688f3140  (unknown)  (unknown)
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361:     @ ... and at least 1 more frames
Connection to localhost closed.

?

What if there's some genuine errors being printed to stderr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we are redirecting stderr to get rid of those message.

f'"{executable} -u {script_path} > {remote_log_path} 2> /dev/null"')

mkdir_code = (f'{cd} && mkdir -p {remote_log_dir} && '
f'touch {remote_log_path}')
Expand Down
11 changes: 9 additions & 2 deletions sky/clouds/kubernetes.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,15 @@ def get_default_instance_type(
# exactly the requested resources.
instance_cpus = float(
cpus.strip('+')) if cpus is not None else cls._DEFAULT_NUM_VCPUS
instance_mem = float(memory.strip('+')) if memory is not None else \
instance_cpus * cls._DEFAULT_MEMORY_CPU_RATIO
if memory is not None:
if memory.endswith('+'):
instance_mem = float(memory[:-1])
elif memory.endswith('x'):
instance_mem = float(memory[:-1]) * instance_cpus
else:
instance_mem = float(memory)
else:
instance_mem = instance_cpus * cls._DEFAULT_MEMORY_CPU_RATIO
virtual_instance_type = kubernetes_utils.KubernetesInstanceType(
instance_cpus, instance_mem).name
return virtual_instance_type
Expand Down
2 changes: 1 addition & 1 deletion sky/clouds/oci.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@

logger = logging.getLogger(__name__)

_tenancy_prefix = None
_tenancy_prefix: Optional[str] = None


@clouds.CLOUD_REGISTRY.register
Expand Down
45 changes: 0 additions & 45 deletions sky/design_docs/onprem-design.md

This file was deleted.

23 changes: 21 additions & 2 deletions sky/provision/instance_setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from concurrent import futures
import functools
import hashlib
import json
import os
import resource
import time
Expand All @@ -13,6 +14,7 @@
from sky.provision import logging as provision_logging
from sky.provision import metadata_utils
from sky.skylet import constants
from sky.utils import accelerator_registry
from sky.utils import command_runner
from sky.utils import common_utils
from sky.utils import subprocess_utils
Expand Down Expand Up @@ -51,8 +53,7 @@
# Command that waits for the ray status to be initialized. Otherwise, a later
# `sky status -r` may fail due to the ray cluster not being ready.
RAY_HEAD_WAIT_INITIALIZED_COMMAND = (
f'while `RAY_ADDRESS=127.0.0.1:{constants.SKY_REMOTE_RAY_PORT} '
'ray status | grep -q "No cluster status."`; do '
f'while `{constants.RAY_STATUS} | grep -q "No cluster status."`; do '
'sleep 0.5; '
'echo "Waiting ray cluster to be initialized"; '
'done;')
Expand Down Expand Up @@ -214,6 +215,22 @@ def _setup_node(runner: command_runner.SSHCommandRunner,
ssh_credentials=ssh_credentials)


def _ray_gpu_options(custom_resource: str) -> str:
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
"""Return the GPU options for the ray start command.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

For some cases (e.g., within docker container), we need to explicitly set
--num-gpus to have ray clusters recognize the schedulable GPUs.
"""
acc_dict = json.loads(custom_resource)
assert len(acc_dict) == 1, acc_dict
acc_name, acc_count = list(acc_dict.items())[0]
if accelerator_registry.is_schedulable_non_gpu_accelerator(acc_name):
return ''
# We need to manually set the number of GPUs, as it may not automatically
# detect the GPUs within the container.
return f' --num-gpus={acc_count}'


@_log_start_end
@_auto_retry
def start_ray_on_head_node(cluster_name: str, custom_resource: Optional[str],
Expand All @@ -239,6 +256,7 @@ def start_ray_on_head_node(cluster_name: str, custom_resource: Optional[str],
f'--temp-dir={constants.SKY_REMOTE_RAY_TEMPDIR}')
if custom_resource:
ray_options += f' --resources=\'{custom_resource}\''
ray_options += _ray_gpu_options(custom_resource)

if cluster_info.custom_ray_options:
if 'use_external_ip' in cluster_info.custom_ray_options:
Expand Down Expand Up @@ -313,6 +331,7 @@ def start_ray_on_worker_nodes(cluster_name: str, no_restart: bool,

if custom_resource:
ray_options += f' --resources=\'{custom_resource}\''
ray_options += _ray_gpu_options(custom_resource)

if cluster_info.custom_ray_options:
for key, value in cluster_info.custom_ray_options.items():
Expand Down
12 changes: 5 additions & 7 deletions sky/setup_files/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ def parse_readme(readme: str) -> str:
# click/grpcio/protobuf.
# Excluded 2.6.0 as it has a bug in the cluster launcher:
# https://github.com/ray-project/ray/releases/tag/ray-2.6.1
'ray[default] >= 2.2.0, <= 2.6.3, != 2.6.0',
'ray[default] >= 2.2.0, <= 2.9.3, != 2.6.0',
]

remote = [
Expand All @@ -183,13 +183,11 @@ def parse_readme(readme: str) -> str:
"grpcio >= 1.32.0, <= 1.51.3, != 1.48.0; python_version < '3.10' and sys_platform != 'darwin'", # noqa:E501
"grpcio >= 1.42.0, <= 1.51.3, != 1.48.0; python_version >= '3.10' and sys_platform != 'darwin'", # noqa:E501
# Adopted from ray's setup.py:
# https://github.com/ray-project/ray/blob/86fab1764e618215d8131e8e5068f0d493c77023/python/setup.py#L326
# https://github.com/ray-project/ray/blob/ray-2.9.3/python/setup.py#L343
'protobuf >= 3.15.3, != 3.19.5',
# Ray job has an issue with pydantic>2.0.0, due to API changes of pydantic. See
# https://github.com/ray-project/ray/issues/36990
# >=1.10.8 is needed for ray>=2.6. See
# https://github.com/ray-project/ray/issues/35661
'pydantic <2.0, >=1.10.8',
# Some pydantic versions are not compatible with ray. Adopted from ray's
# setup.py: https://github.com/ray-project/ray/blob/ray-2.9.3/python/setup.py#L254
'pydantic!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3',
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
]

# NOTE: Change the templates/spot-controller.yaml.j2 file if any of the
Expand Down
40 changes: 36 additions & 4 deletions sky/skylet/constants.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
"""Constants for SkyPilot."""
from packaging import version

import sky

SKY_LOGS_DIRECTORY = '~/sky_logs'
SKY_REMOTE_WORKDIR = '~/sky_workdir'
Expand All @@ -18,7 +21,7 @@
# i.e. the PORT_DICT_STR above.
SKY_REMOTE_RAY_PORT_FILE = '~/.sky/ray_port.json'
SKY_REMOTE_RAY_TEMPDIR = '/tmp/ray_skypilot'
SKY_REMOTE_RAY_VERSION = '2.4.0'
SKY_REMOTE_RAY_VERSION = '2.9.3'

# The name for the environment variable that stores the unique ID of the
# current task. This will stay the same across multiple recoveries of the
Expand Down Expand Up @@ -66,19 +69,48 @@
}

# Install conda on the remote cluster if it is not already installed.
# We do not install the latest conda with python 3.11 because ray has not
# officially supported it yet.
# We use conda with python 3.10 to be consistent across multiple clouds with
# best effort.
# https://github.com/ray-project/ray/issues/31606
# We use python 3.10 to be consistent with the python version of the
# AWS's Deep Learning AMI's default conda environment.
CONDA_INSTALLATION_COMMANDS = (
'which conda > /dev/null 2>&1 || '
'(wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && ' # pylint: disable=line-too-long
'(wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && ' # pylint: disable=line-too-long
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
'bash Miniconda3-Linux-x86_64.sh -b && '
'eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && '
'conda config --set auto_activate_base true); '
'grep "# >>> conda initialize >>>" ~/.bashrc || conda init;')

_sky_version = str(version.parse(sky.__version__))
RAY_STATUS = f'RAY_ADDRESS=127.0.0.1:{SKY_REMOTE_RAY_PORT} ray status'
# Install ray and skypilot on the remote cluster if they are not already
# installed. {var} will be replaced with the actual value in
# backend_utils.write_cluster_config.
RAY_SKYPILOT_INSTALLATION_COMMANDS = (
'(type -a python | grep -q python3) || '
'echo \'alias python=python3\' >> ~/.bashrc;'
'(type -a pip | grep -q pip3) || echo \'alias pip=pip3\' >> ~/.bashrc;'
'mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app;'
'source ~/.bashrc;'
# Backward compatibility for ray upgrade (#3248): do not upgrade ray if the
# ray cluster is already running, to avoid the ray cluster being restarted.
# NOTE: this will only work for the cluster with ray cluster on our latest
# ray port 6380, but those existing cluster launched before #1790 that has
# ray cluster on the default port 6379 will be upgraded and restarted.
f'{RAY_STATUS} || {{ pip3 list | grep "ray " | '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do this conditional update now, but not before?

Looking at previous .j2, it seems like we always pip install new ray -- but those new module files are not necessarily picked by existing, live ray cluster processes? (My understanding of this previous behavior is fuzzy; please correct if wrong.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we keep ray==2.4.0 for a long time, and when we updated ray to 2.4.0 we assume all the users should be able to restart their cluster to adopt the new ray version.

Now, since some users may have existing cluster running the ray cluster with 2.4.0, we have to avoid upgrading ray to make it backward compatible for those clusters, and ensure sky exec and sky launch on that cluster work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some comments here like, "We do this guard to avoid any Ray client-server version mismatch. Specifically: If existing ray cluster is an older version say 2.4, and we pip install new version say 2.9 wheels here, then subsequent sky exec (ray job submit) will have v2.9 vs. 2.4 mismatch, similarly this problem exists for sky status -r (ray status)."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment! Thanks!

f'grep {SKY_REMOTE_RAY_VERSION} 2>&1 > /dev/null || '
f'pip3 install --exists-action w -U ray[default]=={SKY_REMOTE_RAY_VERSION}; }};' # pylint: disable=line-too-long
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
'{ pip3 list | grep "skypilot " && '
'[ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "{sky_wheel_hash}" ]; } || ' # pylint: disable=line-too-long
'{ pip3 uninstall skypilot -y; '
'pip3 install "$(echo ~/.sky/wheels/{sky_wheel_hash}/'
f'skypilot-{_sky_version}*.whl)[{{cloud}}, remote]" && '
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
'echo "{sky_wheel_hash}" > ~/.sky/wheels/current_sky_wheel_hash || '
'exit 1; }; '
f'{RAY_STATUS} || '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do L93-97 apply to this conditional check as well?

Same question as above on why conditional check now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a pointer to those comments too?

'python3 -c "from sky.skylet.ray_patches import patch; patch()" || exit 1;')

# The name for the environment variable that stores SkyPilot user hash, which
# is mainly used to make sure sky commands runs on a VM launched by SkyPilot
# will be recognized as the same user (e.g., spot controller or sky serve
Expand Down
5 changes: 0 additions & 5 deletions sky/skylet/ray_patches/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,6 @@
import os
import subprocess

import pkg_resources

from sky.skylet import constants


Expand Down Expand Up @@ -81,6 +79,3 @@ def patch() -> None:

from ray.autoscaler._private import updater
_run_patch(updater.__file__, _to_absolute('updater.py.patch'))

from ray.dashboard.modules.job import job_head
_run_patch(job_head.__file__, _to_absolute('job_head.py.patch'))
7 changes: 3 additions & 4 deletions sky/skylet/ray_patches/autoscaler.py.patch
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
0a1,4
> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/autoscaler.py
0a1,3
> # From https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/autoscaler/_private/autoscaler.py
> # Sky patch changes:
> # - enable upscaling_speed to be 0.0
>
1068c1072
1074c1077
< if upscaling_speed:
---
> if upscaling_speed is not None: # NOTE(sky): enable 0.0
6 changes: 2 additions & 4 deletions sky/skylet/ray_patches/cli.py.patch
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
0a1,4
> # Adapted from https://github.com/ray-project/ray/blob/ray-2.4.0/dashboard/modules/job/cli.py
> # Adapted from https://github.com/ray-project/ray/blob/ray-2.9.3/dashboard/modules/job/cli.py
> # Fixed the problem in ray's issue https://github.com/ray-project/ray/issues/26514
> # Otherwise, the output redirection ">" will not work.
>
4d7
< from subprocess import list2cmdline
212c215
273c277
< entrypoint=list2cmdline(entrypoint),
---
> entrypoint=" ".join(entrypoint),
2 changes: 1 addition & 1 deletion sky/skylet/ray_patches/command_runner.py.patch
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
0a1,2
> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/command_runner.py
> # From https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/autoscaler/_private/command_runner.py
>
140c142
< "ControlPersist": "10s",
Expand Down
8 changes: 0 additions & 8 deletions sky/skylet/ray_patches/job_head.py.patch

This file was deleted.

6 changes: 3 additions & 3 deletions sky/skylet/ray_patches/log_monitor.py.patch
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
0a1,4
> # Original file https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/_private/log_monitor.py
> # Original file https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/_private/log_monitor.py
> # Fixed the problem for progress bar, as the latest version does not preserve \r for progress bar.
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
> # We change the newline handling back to https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300
>
354c358,359
377c381,382
< next_line = next_line.rstrip("\r\n")
---
> if next_line[-1] == "\n":
> if next_line.endswith("\n"):
> next_line = next_line[:-1]
6 changes: 3 additions & 3 deletions sky/skylet/ray_patches/resource_demand_scheduler.py.patch
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
0a1,5
> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/resource_demand_scheduler.py
> # From https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/autoscaler/_private/resource_demand_scheduler.py
> # Sky patch changes:
> # - no new nodes are allowed to be launched launched when the upscaling_speed is 0
> # - comment out "assert not unfulfilled": this seems a buggy assert
>
450c455,458
451c456,459
< if upper_bound > 0:
---
> # NOTE(sky): do not autoscale when upsclaing speed is 0.
> if self.upscaling_speed == 0:
> upper_bound = 0
> if upper_bound >= 0:
594c602
595c603
< assert not unfulfilled
---
> # assert not unfulfilled # NOTE(sky): buggy assert.
Loading
Loading