Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Autoscaler] Configure idleTimeoutSeconds per node type #48813

Merged

Conversation

ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Nov 20, 2024

Why are these changes needed?

Adds idle_timeout_s as a field to node_type_configs, enabling the v2 autoscaler to configure idle termination per worker type.

This PR depends on a change in KubeRay to the RayCluster CRD, since we want to support passing idleTimeoutSeconds to individual worker groups such that they can specify a custom idle duration: ray-project/kuberay#2558

Related issue number

Closes #36888

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

TODO: @ryanaoleary I'll update this PR with doc/API changes and comments containing my manual testing process.

@ryanaoleary ryanaoleary changed the title Add idleTimeoutSeconds initial commit [Core][Autoscaler] Add idleTimeoutSeconds initial commit Nov 20, 2024
@ryanaoleary ryanaoleary changed the title [Core][Autoscaler] Add idleTimeoutSeconds initial commit [Core][Autoscaler] Add idleTimeoutSeconds to node_type_configs Nov 20, 2024
@ryanaoleary ryanaoleary changed the title [Core][Autoscaler] Add idleTimeoutSeconds to node_type_configs [Core][Autoscaler] Configure idleTimeoutSeconds per node type Nov 20, 2024
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Nov 20, 2024
@ryanaoleary
Copy link
Contributor Author

Manual testing process

KubeRay:

  1. Build Docker image of Ray with these changes
  2. Followed https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md to build custom KubeRay image and install CRDs with workerGroupSpec.idleTimeoutSeconds field.
  3. Deploy a KubeRay autoscaling cluster with Ray dev image
...
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
...
workerGroupSpecs:
  - replicas: 1
    minReplicas: 0
    maxReplicas: 10
    idleTimeoutSeconds: 30 # Only used by v2 Autoscaler, overrides value in autoscaler config
    groupName: small-group
...
  - replicas: 1
    minReplicas: 0
    maxReplicas: 10
    groupName: small-group-no-timeout
  1. First worker with custom timeout set is terminated after 30 seconds:
2024-11-20 15:43:59,532 INFO cloud_provider.py:448 -- Listing pods for RayCluster raycluster-autoscaler in namespace default at pods resource version >= 15657367.
2024-11-20 15:43:59,540 - INFO - Fetched pod data at resource version 15657731.
2024-11-20 15:43:59,540 INFO cloud_provider.py:466 -- Fetched pod data at resource version 15657731.
2024-11-20 15:43:59,542 - INFO - Removing 1 nodes of type small-group (idle).
2024-11-20 15:43:59,542 INFO event_logger.py:76 -- Removing 1 nodes of type small-group (idle).
2024-11-20 15:43:59,543 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=b9107340-b4d1-4422-91cd-f143444df506, type=small-group, cloud_instance_id=raycluster-autoscaler-small-group-worker-r4rcd, ray_id=0e76fc9048dcfbb487a7253e9c8db2daf9356293072a244f2b134a0c): draining ray: idle for 32.488 secs > timeout=30 secs
2024-11-20 15:43:59,543 INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=b9107340-b4d1-4422-91cd-f143444df506, type=small-group, cloud_instance_id=raycluster-autoscaler-small-group-worker-r4rcd, ray_id=0e76fc9048dcfbb487a7253e9c8db2daf9356293072a244f2b134a0c): draining ray: idle for 32.488 secs > timeout=30 secs
2024-11-20 15:43:59,545 - INFO - Drained ray on 0e76fc9048dcfbb487a7253e9c8db2daf9356293072a244f2b134a0c(success=True, msg=)
2024-11-20 15:43:59,545 INFO ray_stopper.py:116 -- Drained ray on 0e76fc9048dcfbb487a7253e9c8db2daf9356293072a244f2b134a0c(success=True, msg=)
  1. Worker group with default idle timeout terminates after 60 seconds:
2024-11-20 15:45:15,457 INFO cloud_provider.py:448 -- Listing pods for RayCluster raycluster-autoscaler in namespace default at pods resource version >= 15657367.
2024-11-20 15:45:15,466 - INFO - Fetched pod data at resource version 15658700.
2024-11-20 15:45:15,466 INFO cloud_provider.py:466 -- Fetched pod data at resource version 15658700.
2024-11-20 15:45:15,467 - INFO - Removing 1 nodes of type small-group-no-timeout (idle).
2024-11-20 15:45:15,467 INFO event_logger.py:76 -- Removing 1 nodes of type small-group-no-timeout (idle).
2024-11-20 15:45:15,468 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=0bf76730-5c8c-4b1e-bda5-09e1e109102a, type=small-group-no-timeout, cloud_instance_id=raycluster-autoscaler-small-group-no-timeout-worker-fkq5s, ray_id=269217f3353d079851578780c65ac1f5478c16ab46ce5a9afd0d1089): draining ray: idle for 61.197 secs > timeout=60.0 secs
2024-11-20 15:45:15,468 INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=0bf76730-5c8c-4b1e-bda5-09e1e109102a, type=small-group-no-timeout, cloud_instance_id=raycluster-autoscaler-small-group-no-timeout-worker-fkq5s, ray_id=269217f3353d079851578780c65ac1f5478c16ab46ce5a9afd0d1089): draining ray: idle for 61.197 secs > timeout=60.0 secs
2024-11-20 15:45:15,470 - INFO - Drained ray on 269217f3353d079851578780c65ac1f5478c16ab46ce5a9afd0d1089(success=True, msg=)
2024-11-20 15:45:15,470 INFO ray_stopper.py:116 -- Drained ray on 269217f3353d079851578780c65ac1f5478c16ab46ce5a9afd0d1089(success=True, msg=)

ryanaoleary and others added 3 commits November 21, 2024 02:29
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

nice

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Nov 21, 2024
@kevin85421
Copy link
Member

Before merging this PR, would you mind:

  1. manually testing it again
  2. opening an issue in the KubeRay repo to add a test for this PR once the image is available?
  3. add tests in test_scheduler.py

@kevin85421 kevin85421 removed the go add ONLY when ready to merge, run all tests label Nov 21, 2024
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

Before merging this PR, would you mind:

  1. manually testing it again
  2. opening an issue in the KubeRay repo to add a test for this PR once the image is available?
  3. add tests in test_scheduler.py
  1. Manual test after changes (using same yaml as above):

autoscaler-test-pic

Autoscaler logs show available_node_types:

available_node_types:
  head-group:
    max_workers: 0
    min_workers: 0
    node_config: {}
    resources:
      CPU: 1
      memory: 2000000000
  small-group:
    idle_timeout_s: 30
    max_workers: 10
    min_workers: 0
    node_config: {}
    resources:
      CPU: 1
      memory: 1000000000
  small-group-no-timeout:
    max_workers: 10
    min_workers: 0
    node_config: {}
    resources:
      CPU: 1
      memory: 1000000000

worker group with idleTimeoutSeconds: 30 scaled down:

2024-11-21 15:33:25,391 INFO cloud_provider.py:448 -- Listing pods for RayCluster raycluster-autoscaler in namespace default at pods resource version >= 16650820.
2024-11-21 15:33:25,399 - INFO - Fetched pod data at resource version 16651184.
2024-11-21 15:33:25,399 INFO cloud_provider.py:466 -- Fetched pod data at resource version 16651184.
2024-11-21 15:33:30,426 - INFO - Calculating hashes for file mounts and ray commands.
2024-11-21 15:33:30,426 INFO config.py:184 -- Calculating hashes for file mounts and ray commands.
2024-11-21 15:33:30,443 - INFO - Listing pods for RayCluster raycluster-autoscaler in namespace default at pods resource version >= 16650820.
2024-11-21 15:33:30,443 INFO cloud_provider.py:448 -- Listing pods for RayCluster raycluster-autoscaler in namespace default at pods resource version >= 16650820.
2024-11-21 15:33:30,451 - INFO - Fetched pod data at resource version 16651245.
2024-11-21 15:33:30,451 INFO cloud_provider.py:466 -- Fetched pod data at resource version 16651245.
2024-11-21 15:33:30,453 - INFO - Removing 1 nodes of type small-group (idle).
2024-11-21 15:33:30,453 INFO event_logger.py:76 -- Removing 1 nodes of type small-group (idle).
2024-11-21 15:33:30,453 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=9b435c68-a46c-451b-a28b-657e6835bda7, type=small-group, cloud_instance_id=raycluster-autoscaler-small-group-worker-mz85t, ray_id=c14f440c7029e1f788a133ef1d5b01b258211b6afb939d80689f6e71): draining ray: idle for 32.438 secs > timeout=30 secs
2024-11-21 15:33:30,453 INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=9b435c68-a46c-451b-a28b-657e6835bda7, type=small-group, cloud_instance_id=raycluster-autoscaler-small-group-worker-mz85t, ray_id=c14f440c7029e1f788a133ef1d5b01b258211b6afb939d80689f6e71): draining ray: idle for 32.438 secs > timeout=30 secs

worker group without idleTimeoutSeconds set, defaulting to 60s, scaled down:

2024-11-21 15:33:55,767 INFO cloud_provider.py:448 -- Listing pods for RayCluster raycluster-autoscaler in namespace default at pods resource version >= 16650820.
2024-11-21 15:33:55,774 - INFO - Fetched pod data at resource version 16651560.
2024-11-21 15:33:55,774 INFO cloud_provider.py:466 -- Fetched pod data at resource version 16651560.
2024-11-21 15:34:00,801 - INFO - Calculating hashes for file mounts and ray commands.
2024-11-21 15:34:00,801 INFO config.py:184 -- Calculating hashes for file mounts and ray commands.
2024-11-21 15:34:00,821 - INFO - Listing pods for RayCluster raycluster-autoscaler in namespace default at pods resource version >= 16650820.
2024-11-21 15:34:00,821 INFO cloud_provider.py:448 -- Listing pods for RayCluster raycluster-autoscaler in namespace default at pods resource version >= 16650820.
2024-11-21 15:34:00,830 - INFO - Fetched pod data at resource version 16651622.
2024-11-21 15:34:00,830 INFO cloud_provider.py:466 -- Fetched pod data at resource version 16651622.
2024-11-21 15:34:00,832 - INFO - Removing 1 nodes of type small-group-no-timeout (idle).
2024-11-21 15:34:00,832 INFO event_logger.py:76 -- Removing 1 nodes of type small-group-no-timeout (idle).
2024-11-21 15:34:00,832 - INFO - Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=77da1ec1-26c9-4ed2-992c-1aa160fe240e, type=small-group-no-timeout, cloud_instance_id=raycluster-autoscaler-small-group-no-timeout-worker-kbnnt, ray_id=f3e158ea294c2342b06893793d71e0e62f28999a96f6232c26ecf0f1): draining ray: idle for 62.951 secs > timeout=60.0 secs
2024-11-21 15:34:00,832 INFO instance_manager.py:262 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=77da1ec1-26c9-4ed2-992c-1aa160fe240e, type=small-group-no-timeout, cloud_instance_id=raycluster-autoscaler-small-group-no-timeout-worker-kbnnt, ray_id=f3e158ea294c2342b06893793d71e0e62f28999a96f6232c26ecf0f1): draining ray: idle for 62.951 secs > timeout=60.0 secs
  1. [Feature] Add e2e Ray v2 Autoscaler Tests with KubeRay kuberay#2561
  2. Test case added in a3e7210.

There was a CI error for a Ray Serve test but I think it's unrelated to this PR.

ryanaoleary and others added 3 commits November 22, 2024 01:35
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Nov 22, 2024
@kevin85421
Copy link
Member

cc @rickyyx this PR looks good to me. Would you mind taking a look? Thanks!

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a nit.

@@ -128,6 +128,8 @@ class NodeTypeConfig:
min_worker_nodes: int
# The maximal number of worker nodes can be launched for this node type.
max_worker_nodes: int
# Idle timeout seconds for worker nodes of this node type.
idle_timeout_s: Optional[float] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we enforce it as integer with a cast when we add this? I see it being int as part of the schema

Or we could make this a float in the schema too. No preference over this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to a number type in the schema and then add a cast to float when we call idle_timeout_s = group_spec.get(IDLE_SECONDS_KEY), since I implemented it as an int in the RayCluster CRD for consistency with the other field: https://github.com/ray-project/kuberay/blob/925effe34022c72c41691c0b79d8d3051d4a1b77/ray-operator/apis/ray/v1/raycluster_types.go#L94

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the tests again and implemented this change in: 1bd8afb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the great work!

@@ -1434,6 +1434,82 @@ def test_idle_termination_with_min_worker(min_workers):
assert len(to_terminate) == 0


@pytest.mark.parametrize("node_type_idle_timeout_s", [1, 2, 10])
def test_idle_termination_with_node_type_idle_timeout(node_type_idle_timeout_s):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@rickyyx rickyyx enabled auto-merge (squash) November 23, 2024 23:35
@rickyyx rickyyx enabled auto-merge (squash) November 23, 2024 23:37
@rickyyx rickyyx merged commit 4a29571 into ray-project:master Nov 24, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] worker-specific idleTimeoutSeconds
4 participants