Start Ray on the head and the worker nodes #305

DenisYay · 2024-01-04T17:34:24Z

The algorithm:

We connect to the existing Ray instance started in _start_server_cmds and initialized inside http_server.py. The instance listens on port 6379 (note: the Skypilot ray instance listens on port 6380).
We start the required number of Ray workers as part of the restart login in cluster.py. The workers start on the number of nodes that is derived from cluster.ips (-1 for the head node)
We await for all workers to join the Ray cluster in http_server.py (verified by a call to ray.nodes() )
Once the above completes successfully, we proceed with the restart flow, otherwise an error is thrown

…using self.run (, node)

dongreenberg · 2024-01-08T15:09:09Z

runhouse/resources/hardware/cluster.py

+    def _start_ray(self, host, master_host, n_hosts, ray_port):
+        if host == master_host:
+            # Head node
+            if ray.is_initialized():


Checking is_initialized in too coarse, as it will check if any ray cluster is connected. Unfortunately if we accidentally kill SkyPilot's Ray cluster autostop breaks, and it's tricky to restart. What we ideally want is to check if ray has already started on the specified port (and/or maybe with our "runhouse" namespace?), and we may need to use a subprocess.run for that.

Removed the current check.

We could run
nmap -sV --reason -p 6381 127.0.0.1

and check that we get

PORT STATE SERVICE REASON VERSION
6381/tcp open redis? syn-ack

Problem: we need to install nmap.

or

nc -vv -z 127.0.0.1 6381

Connection to domain_name 6381 port [tcp/*] succeeded!

But this result is too coarse imho (not Ray specific).

Please let me know if you have a preference between the two, but I don't think it's a blocker for a first merge.

dongreenberg · 2024-01-08T15:10:09Z

runhouse/resources/hardware/cluster.py

+                logger.info(
+                    f"There is a Ray cluster already running on the head node {master_host}. Shutting it down."
+                )
+                ray.shutdown()


Ditto the above about being too coarse. Also unfortunately there is no way to stop ray only on a single port. That's why we use a pkill command matching the specific port today. It's not ideal because Ray would otherwise tear down the resources more comprehensively, but it is what it is.

Yep, removed the coarse code.

dongreenberg · 2024-01-08T15:12:38Z

runhouse/resources/hardware/cluster.py

+                self._start_ray(host, master_host, n_hosts, self.DEFAULT_RAY_PORT)
+
+            # logger.info("🎉 All workers present and accounted for 🎉")
+            # logger.info(ray.cluster_resources())


Good reminder, we need a better cluster.state API.

tests/fixtures/on_demand_cluster_fixtures.py

tests/test_resources/test_clusters/test_multinode_cluster.py

dongreenberg · 2024-01-08T15:16:34Z

runhouse/resources/hardware/cluster.py

+            # Worker node
+            self.run(
+                commands=[
+                    f"sleep 10 && ray start --address={master_host}:{ray_port} --block",


We need to start the workers inside the runhouse namespace

did you mean the on the worker nodes? I'm not sure that ray start supports that? ray.init(address="ray://123.45.67.89:6381", namespace="runhouse") does, but I'm not sure that that's the recommended way to connect to the cluster: https://docs.ray.io/en/latest/ray-core/configure.html#cluster-resources

dongreenberg · 2024-01-08T15:19:15Z

runhouse/resources/hardware/cluster.py

+            master_host = self.address
+            n_hosts = len(self.ips)
+            for host in self.ips:
+                self._start_ray(host, master_host, n_hosts, self.DEFAULT_RAY_PORT)


I'm not sure we need to start on the head node, we already start ray remotely on the head node above (also, that runs remotely, where as this restart logic would start the head node locally, which I don't think is correct). I think we could just connect each of the workers to that, and it would be less disruptive to the existing server too.

Yes, I've updated the code to start remotely on the head node and remotely on the workers.

Without the explicit start on the head node, we have two ray servers: 6380 by SkyPilot and 6379 default.

Do you mean we should use the existing 6379 default one and connect the workers to it?

As based on an offline convo, we are reusing the existing Ray instance on port 6379

…ad node

runhouse/resources/hardware/cluster.py

dongreenberg · 2024-01-10T01:27:08Z

runhouse/resources/hardware/cluster.py

@@ -787,6 +815,15 @@ def restart_server(
            self.client.use_https = https_flag
            self.client.cert_path = self.cert_config.cert_path

+        if restart_ray:
+            # Restart Ray on the head node and each of the workers
+            # TODO: kill ray on all nodes first. Need to think more of the


We should find out if it dies automatically on the workers when we kill it on the head or if we're really borking them when we pkill the head node.

Seems like 'RAY_HEAD_IP=127.0.0.1 RAY_HEAD_PORT=6379 ray stop' should stop the entire cluster with workers when run on the head node.

Nope, it doesn't seem to work..

Back to pkill for now.

The workers do not die automatically, they become orphaned. There seems to be a way to carefully send shutdown commands to them, but need to experiment further. We can probably decouple it from the current PR?

Next to try: python3 -c "import ray; ray.init('ray://localhost:6379'); ray.shutdown()"
https://docs.ray.io/en/latest/ray-core/api/doc/ray.shutdown.html

If that doesn't work, maybe this is a more involved and explicit way:
https://stackoverflow.com/questions/69613739/how-to-kill-ray-tasks-when-the-driver-is-dead

runhouse/resources/hardware/cluster_factory.py

tests/test_resources/test_clusters/test_multinode_cluster.py

tests/fixtures/on_demand_cluster_fixtures.py

DenisYay and others added 14 commits December 28, 2023 17:19

Restart ray workers

5f53b27

Merge remote-tracking branch 'origin/main' into DenisYay-patch-1

f0520b7

in progress

f7aa753

Merge remote-tracking branch 'origin/main' into DenisYay-patch-1

a285e44

Merge remote-tracking branch 'origin/main' into DenisYay-patch-1

d8e1204

initial implementation for restart_server

913649e

bug fixes

f86967c

initial test

60df0c1

bug fixes (ray port numeric value)

c5abed0

up the ray port

38f13e4

fix commands

daddceb

Execute the head node using regular python (no self.run) and workers …

7260508

…using self.run (, node)

cleanup

02445b4

update the kill ray command port

937e7cd

DenisYay requested review from jlewitt1 and dongreenberg January 8, 2024 14:57

remove ray_restart_cmd as it doesn't seem to be needed atm

b15cabf

dongreenberg reviewed Jan 8, 2024

View reviewed changes

DenisYay added 6 commits January 8, 2024 20:48

run all commands remotely. Seems to work.

b75546f

cleanup

01debcb

address comments

e4da7dc

runhouse namespace

d7b52ac

update the logic to use the default 6379 ray server running on the he…

f8f4c8a

…ad node

move the wait for workers login into http_server

a3636c6

DenisYay requested a review from dongreenberg January 9, 2024 20:46

dongreenberg reviewed Jan 10, 2024

View reviewed changes

DenisYay added 4 commits January 10, 2024 17:19

update the name of the cluster back :)

6b51bc7

Address Donny's latest comments

6e30398

Merge remote-tracking branch 'origin/main' into DenisYay-patch-1

5f51eeb

Restore the logic of waiting on workers to http_server.py

b537ef2

DenisYay added 2 commits January 11, 2024 00:12

Restore defaults.py

bcfe0fa

count the number of nodes based on the return value from .run()

be98b24

DenisYay merged commit bcf576b into main Jan 11, 2024
5 of 8 checks passed

jlewitt1 deleted the DenisYay-patch-1 branch January 13, 2024 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start Ray on the head and the worker nodes #305

Start Ray on the head and the worker nodes #305

DenisYay commented Jan 4, 2024 •

edited

Loading

dongreenberg Jan 8, 2024

DenisYay Jan 9, 2024

dongreenberg Jan 8, 2024

DenisYay Jan 9, 2024

dongreenberg Jan 8, 2024

DenisYay Jan 9, 2024

dongreenberg Jan 8, 2024

DenisYay Jan 9, 2024

dongreenberg Jan 8, 2024

DenisYay Jan 9, 2024

DenisYay Jan 9, 2024

dongreenberg Jan 10, 2024

DenisYay Jan 10, 2024

DenisYay Jan 10, 2024

DenisYay Jan 10, 2024 •

edited

Loading

Start Ray on the head and the worker nodes #305

Start Ray on the head and the worker nodes #305

Conversation

DenisYay commented Jan 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DenisYay Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

DenisYay commented Jan 4, 2024 •

edited

Loading

DenisYay Jan 10, 2024 •

edited

Loading