-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Autoscaler] Configure idleTimeoutSeconds per node type #48813
[Core][Autoscaler] Configure idleTimeoutSeconds per node type #48813
Conversation
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
TODO: @ryanaoleary I'll update this PR with doc/API changes and comments containing my manual testing process. |
Manual testing processKubeRay:
|
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging this PR, would you mind:
|
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Autoscaler logs show available_node_types:
worker group with
worker group without
There was a CI error for a Ray Serve test but I think it's unrelated to this PR. |
cc @rickyyx this PR looks good to me. Would you mind taking a look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit.
@@ -128,6 +128,8 @@ class NodeTypeConfig: | |||
min_worker_nodes: int | |||
# The maximal number of worker nodes can be launched for this node type. | |||
max_worker_nodes: int | |||
# Idle timeout seconds for worker nodes of this node type. | |||
idle_timeout_s: Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we enforce it as integer with a cast when we add this? I see it being int as part of the schema
Or we could make this a float in the schema too. No preference over this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to a number
type in the schema and then add a cast to float when we call idle_timeout_s = group_spec.get(IDLE_SECONDS_KEY)
, since I implemented it as an int in the RayCluster CRD for consistency with the other field: https://github.com/ray-project/kuberay/blob/925effe34022c72c41691c0b79d8d3051d4a1b77/ray-operator/apis/ray/v1/raycluster_types.go#L94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the tests again and implemented this change in: 1bd8afb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the great work!
@@ -1434,6 +1434,82 @@ def test_idle_termination_with_min_worker(min_workers): | |||
assert len(to_terminate) == 0 | |||
|
|||
|
|||
@pytest.mark.parametrize("node_type_idle_timeout_s", [1, 2, 10]) | |||
def test_idle_termination_with_node_type_idle_timeout(node_type_idle_timeout_s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Why are these changes needed?
Adds
idle_timeout_s
as a field tonode_type_configs
, enabling the v2 autoscaler to configure idle termination per worker type.This PR depends on a change in KubeRay to the RayCluster CRD, since we want to support passing
idleTimeoutSeconds
to individual worker groups such that they can specify a custom idle duration: ray-project/kuberay#2558Related issue number
Closes #36888
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.