Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Enable Scaling Down for Multi-Host TPU Replicas #43470

Merged
merged 57 commits into from
Jul 3, 2024

Conversation

ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Feb 27, 2024

Why are these changes needed?

Adds support for Ray autoscaler and Kuberay NodeProvider to scale-down TPU podslices. TPU podslices are atomic, so it is necessary to scale down all Ray nodes belonging to a TPU podslice together. This PR associates nodes with the replica (representing a podslice) of the TPU worker group they belong to using a replicaIndex Pod label which is set through a GKE webhook. When a TPU node is deleted, other nodes in that replica (tracked through a mapping) are scheduled to delete as well.

Related PR: #45105

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary force-pushed the autoscaling-changes branch 2 times, most recently from d4756ed to 99c2ff5 Compare April 5, 2024 22:00
@ryanaoleary ryanaoleary marked this pull request as ready for review May 8, 2024 10:42
@ryanaoleary ryanaoleary changed the title Multi-Host Replica Autoscaling Support [Core] Enable Scaling Down for Multi-Host TPU Replicas May 8, 2024
@anyscalesam anyscalesam added kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared labels May 29, 2024
@kevin85421 kevin85421 self-assigned this May 29, 2024
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/kuberay/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/batching_node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/batching_node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/batching_node_provider.py Outdated Show resolved Hide resolved
@kevin85421
Copy link
Member

This is on a critical code path. We should have more testing. Let's discuss it in today's sync.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

ryanaoleary commented Jun 28, 2024

This PR was manually tested as follows:

Prerequisites:

  1. GKE cluster with TPU quota and Node Autoprovisioning enabled, or a v4 2x2x2 TPU nodepool already created.
  2. Ray TPU initialization webhook installed in-cluster.
  3. Kuberay operator v1.1.1 installed in-cluster.

Testing:

  1. Build Ray from source and replace autoscaling image in below RayCluster with one containing these changes.
  2. Apply autoscaler template with detached actor scripts, respectively edited to include a TPU worker group and request resources={"TPU": 4}.
  • Detached Actor:
  import ray
  import sys

  @ray.remote(num_cpus=1, resources={"TPU": 4})
  class Actor:
    pass

  ray.init(namespace="default_namespace")
  Actor.options(name=sys.argv[1], lifetime="detached").remote()
  • TPU workerg group:
  - replicas: 0
  minReplicas: 0
  maxReplicas: 2
  numOfHosts: 2
  groupName: tpu-group
  rayStartParams: 
    resources: '"{\"TPU\": 4}"'
   ...
   requests:
            cpu: "1"
            ephemeral-storage: 10Gi
            google.com/tpu: "4"
            memory: 40G
      nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
          cloud.google.com/gke-tpu-topology: 2x2x2
  1. Scale up two TPU workers using detached actors with a resource request of "TPU: 4" each. The autoscaler will scale up 1 replica of the tpu-group worker group to meet this request, which will create 2 workers since numOfHosts: 2:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor1
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor2

kubectl_get_pods

  1. kubectl describe worker Pods to verify they are created with the GKE set replicaIndex label for multi-host workers:
    replica_index_label

  2. Delete one of the detached actors causing the node to become idle:

kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/terminate_detached_actor.py actor1

one_actor_dead

  1. Once the node is marked as idle, the autoscaler should terminate the node. The BatchingNodeProvider will detect the replicaIndex label on each node, and scale down the other worker in the replica at the same time:
  • Both workers deleted (there is still one detached actor alive requesting TPUs so a new multi-host group is then scaled back up):

nodes_terminating

  • Autoscaler logs:

autoscaler_logs_tpu_scale_down

ryanaoleary and others added 2 commits July 1, 2024 17:16
@ryanaoleary
Copy link
Contributor Author

Could you share more details about the detached actor and add more details about why you expect the cluster to look like this at each step?

Sure, I edited the comment to include more detail.

@kevin85421
Copy link
Member

@ryanaoleary could you also rebase your branch to fix the CI error? Thanks!

ryanaoleary and others added 2 commits July 2, 2024 00:47
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Jul 2, 2024
@kevin85421
Copy link
Member

@can-anyscale could you retry the failed test? It is unrelated to this PR. Thanks!

@kevin85421
Copy link
Member

The RLLib tests fail after retry, but I don't think that is related to this PR because this PR is only for KubeRay. cc @jjyao @can-anyscale

@ryanaoleary ryanaoleary requested a review from jjyao July 2, 2024 22:59
@jjyao jjyao merged commit 2abca38 into ray-project:master Jul 3, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants