[Core] Enable Scaling Down for Multi-Host TPU Replicas #43470

ryanaoleary · 2024-02-27T21:11:15Z

Why are these changes needed?

Adds support for Ray autoscaler and Kuberay NodeProvider to scale-down TPU podslices. TPU podslices are atomic, so it is necessary to scale down all Ray nodes belonging to a TPU podslice together. This PR associates nodes with the replica (representing a podslice) of the TPU worker group they belong to using a replicaIndex Pod label which is set through a GKE webhook. When a TPU node is deleted, other nodes in that replica (tracked through a mapping) are scheduled to delete as well.

Related PR: #45105

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Manual tests
- Release tests
- This PR is not tested :(

python/ray/autoscaler/_private/autoscaler.py

python/ray/autoscaler/_private/kuberay/node_provider.py

python/ray/autoscaler/batching_node_provider.py

python/ray/autoscaler/_private/autoscaler.py

python/ray/autoscaler/batching_node_provider.py

kevin85421 · 2024-05-30T17:45:48Z

This is on a critical code path. We should have more testing. Let's discuss it in today's sync.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

python/ray/tests/kuberay/test_files/podlist2.yaml

python/ray/tests/test_autoscaler.py

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

python/ray/tests/test_batch_node_provider_unit.py

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2024-06-28T23:58:40Z

This PR was manually tested as follows:

Prerequisites:

GKE cluster with TPU quota and Node Autoprovisioning enabled, or a v4 2x2x2 TPU nodepool already created.
Ray TPU initialization webhook installed in-cluster.
Kuberay operator v1.1.1 installed in-cluster.

Testing:

Build Ray from source and replace autoscaling image in below RayCluster with one containing these changes.
Apply autoscaler template with detached actor scripts, respectively edited to include a TPU worker group and request resources={"TPU": 4}.

Detached Actor:

  import ray
  import sys

  @ray.remote(num_cpus=1, resources={"TPU": 4})
  class Actor:
    pass

  ray.init(namespace="default_namespace")
  Actor.options(name=sys.argv[1], lifetime="detached").remote()

TPU workerg group:

  - replicas: 0
  minReplicas: 0
  maxReplicas: 2
  numOfHosts: 2
  groupName: tpu-group
  rayStartParams: 
    resources: '"{\"TPU\": 4}"'
   ...
   requests:
            cpu: "1"
            ephemeral-storage: 10Gi
            google.com/tpu: "4"
            memory: 40G
      nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
          cloud.google.com/gke-tpu-topology: 2x2x2

Scale up two TPU workers using detached actors with a resource request of "TPU: 4" each. The autoscaler will scale up 1 replica of the tpu-group worker group to meet this request, which will create 2 workers since numOfHosts: 2:

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor1
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor2

kubectl describe worker Pods to verify they are created with the GKE set replicaIndex label for multi-host workers:
Delete one of the detached actors causing the node to become idle:

kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/terminate_detached_actor.py actor1

Once the node is marked as idle, the autoscaler should terminate the node. The BatchingNodeProvider will detect the replicaIndex label on each node, and scale down the other worker in the replica at the same time:

Both workers deleted (there is still one detached actor alive requesting TPUs so a new multi-host group is then scaled back up):

Autoscaler logs:

python/ray/tests/test_batch_node_provider_unit.py

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2024-07-01T17:22:45Z

Could you share more details about the detached actor and add more details about why you expect the cluster to look like this at each step?

Sure, I edited the comment to include more detail.

python/ray/autoscaler/batching_node_provider.py

kevin85421 · 2024-07-02T00:40:51Z

@ryanaoleary could you also rebase your branch to fix the CI error? Thanks!

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

kevin85421 · 2024-07-02T16:47:09Z

@can-anyscale could you retry the failed test? It is unrelated to this PR. Thanks!

kevin85421 · 2024-07-02T18:58:21Z

The RLLib tests fail after retry, but I don't think that is related to this PR because this PR is only for KubeRay. cc @jjyao @can-anyscale

ryanaoleary force-pushed the autoscaling-changes branch from 17b52b1 to df0994b Compare February 28, 2024 00:14

ryanaoleary force-pushed the autoscaling-changes branch 2 times, most recently from d4756ed to 99c2ff5 Compare April 5, 2024 22:00

ryanaoleary marked this pull request as ready for review May 8, 2024 10:42

ryanaoleary requested review from ericl, architkulkarni, hongchaodeng and a team as code owners May 8, 2024 10:42

ryanaoleary changed the title ~~Multi-Host Replica Autoscaling Support~~ [Core] Enable Scaling Down for Multi-Host TPU Replicas May 8, 2024

richardsliu reviewed May 8, 2024

View reviewed changes

python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved

allenwang28 reviewed May 9, 2024

View reviewed changes

python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved

ryanaoleary mentioned this pull request May 14, 2024

[Core] Enable TPU Autoscaling with Kuberay #45105

Merged

9 tasks

anyscalesam added kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared labels May 29, 2024

kevin85421 self-assigned this May 29, 2024

kevin85421 reviewed May 30, 2024

View reviewed changes

ryanaoleary force-pushed the autoscaling-changes branch 2 times, most recently from 627fcb2 to 7cfe9db Compare June 7, 2024 02:42

ryanaoleary requested review from a team, richardliaw, rkooo567, jjyao, scv119, amogkam, matthewdeng, edoakes, shrekris-anyscale, zcin and GeneDer as code owners June 7, 2024 02:42

Remove newline

776f8df

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary requested a review from kevin85421 June 27, 2024 03:14

ryanaoleary added 2 commits June 27, 2024 08:38

Merge branch 'ray-project:master' into autoscaling-changes

5659499

Merge branch 'master' into autoscaling-changes

3f168c5

kevin85421 reviewed Jun 27, 2024

View reviewed changes

python/ray/tests/kuberay/test_files/podlist2.yaml Show resolved Hide resolved

python/ray/tests/test_autoscaler.py Outdated Show resolved Hide resolved

kevin85421 reviewed Jun 27, 2024

View reviewed changes

python/ray/tests/test_autoscaler.py Outdated Show resolved Hide resolved

Mock BatchNodeProvider directly

76f092d

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary requested a review from kevin85421 June 27, 2024 23:00

Update podlist2 with actual TPU pod yaml

f1037d1

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

kevin85421 reviewed Jun 28, 2024

View reviewed changes

ryanaoleary added 2 commits June 28, 2024 20:16

Fix test comments

302b69a

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Fix batching node provider log statement

14e8f79

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary requested a review from kevin85421 June 28, 2024 23:59

kevin85421 reviewed Jun 29, 2024

View reviewed changes

python/ray/tests/test_batch_node_provider_unit.py Outdated Show resolved Hide resolved

ryanaoleary and others added 2 commits July 1, 2024 17:16

Remove unused index increment

aa28221

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Merge branch 'master' into autoscaling-changes

875767d

jjyao reviewed Jul 2, 2024

View reviewed changes

python/ray/autoscaler/batching_node_provider.py Outdated Show resolved Hide resolved

python/ray/autoscaler/batching_node_provider.py Outdated Show resolved Hide resolved

ryanaoleary and others added 2 commits July 2, 2024 00:47

Add type annotation and fix nits

2aa5a31

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Merge branch 'master' into autoscaling-changes

b67be14

kevin85421 added the go add ONLY when ready to merge, run all tests label Jul 2, 2024

Merge branch 'master' into autoscaling-changes

de1fa8a

ryanaoleary requested a review from jjyao July 2, 2024 22:59

kevin85421 approved these changes Jul 3, 2024

View reviewed changes

jjyao approved these changes Jul 3, 2024

View reviewed changes

jjyao merged commit 2abca38 into ray-project:master Jul 3, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Enable Scaling Down for Multi-Host TPU Replicas #43470

[Core] Enable Scaling Down for Multi-Host TPU Replicas #43470

ryanaoleary commented Feb 27, 2024 •

edited

Loading

kevin85421 commented May 30, 2024

ryanaoleary commented Jun 28, 2024 •

edited

Loading

ryanaoleary commented Jul 1, 2024

kevin85421 commented Jul 2, 2024

kevin85421 commented Jul 2, 2024

kevin85421 commented Jul 2, 2024

[Core] Enable Scaling Down for Multi-Host TPU Replicas #43470

[Core] Enable Scaling Down for Multi-Host TPU Replicas #43470

Conversation

ryanaoleary commented Feb 27, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 commented May 30, 2024

ryanaoleary commented Jun 28, 2024 • edited Loading

Prerequisites:

Testing:

ryanaoleary commented Jul 1, 2024

kevin85421 commented Jul 2, 2024

kevin85421 commented Jul 2, 2024

kevin85421 commented Jul 2, 2024

ryanaoleary commented Feb 27, 2024 •

edited

Loading

ryanaoleary commented Jun 28, 2024 •

edited

Loading