[ray integration] Initial Ray Integration with RayExecutor API #2218

richardliaw · 2020-08-30T21:11:28Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Initial PR to introduce a Ray runner for Horovod. The interface is currently specific to using Actors (stateful operators).

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

tgaddair · 2020-08-31T15:47:32Z

Looks good so far! A few of things to help with getting CI to pass:

We should add a horovod[ray] extra in setup.py similar to what we do for Spark here.
Install using extras [spark,ray] for both Dockerfile.test.cpu and Dockerfile.test.gpu.
Exclude Ray tests from MPI configs in Buildkite (as this feature is Gloo-only) by adding them to the exclude_standalone_test variable.
Make sure these tests run in "standalone" mode for Gloo, meaning we don't launch them with horovodrun, as shown here.
Run BUILDKITE_PIPELINE_SLUG=SLUG BUILDKITE_BRANCH=BRANCH .buildkite/gen-pipeline.sh > test/data/expected_buildkite_pipeline.yaml to update expected Buildkite pipeline definition.

Hopefully, that should get everything working in CI.

richardliaw · 2020-08-31T23:08:36Z

test/test_cluster_ray.py

+import os
+
+import ray


This file probably needs a little bit of clean-up; it's written in a way for multi-node setups but we can 1. get the pytest infrastructure up and 2. make it work on single-node setups first.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Travis Addair <taddair@uber.com>

tgaddair · 2020-09-01T13:36:05Z

docs/ray.rst

+
+All actors will be part of the Horovod ring, so ``RayExecutor`` invocations will be able to support arbitrary Horovod collective operations.
+
+Note that there is an implicit assumption on the cluster being homogenous in shape (i.e., all machines have the same number of slots available). This is simply


Should be addressed following #2212. We could consider allowing hosts to be a list of host:slot in a follow-up PR.

Sounds good!

examples/tensorflow2_mnist_ray.py

horovod/ray/runner.py

test/data/expected_buildkite_pipeline.yaml

Signed-off-by: Travis Addair <taddair@uber.com>

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

horovod/ray/runner.py

tgaddair · 2020-09-01T18:34:10Z

horovod/ray/runner.py

+        # colocated workers.
+        gpu_ids = ray.get_gpu_ids()
+        for worker, gpu_id in zip(self.workers, gpu_ids):
+            worker.update_env_vars.remote({"CUDA_VISIBLE_DEVICES": gpu_id})


gpu_id is only one GPU, is that correct?

For NCCL, we need to avoid setting CUDA_VISIBLE_DEVICES. NCCL needs to have visibility of adjacent devices in order to use CUDA IPC. Typically, we handle GPU isolation at the framework level, for example:

torch.cuda.set_device(hvd.local_rank())

If you need to set CUDA_VISIBLE_DEVICES because multiple Ray applications may be using the same nodes, then we can set "CUDA_VISIBLE_DEVICES": gpu_ids so every worker can see the devices in its local communicator.

horovod/ray/runner.py

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

tgaddair

LGTM! Will land once tests pass.

den-run-ai · 2020-09-05T02:34:43Z

Super-exciting PR!

yangw1234 · 2020-09-13T07:07:57Z

horovod/ray/runner.py

+        self.global_rendezv_port = self.rendezvous.start()
+        self.rendezvous.init(host_alloc_plan)
+        # remote_host_names = network.filter_local_addresses()
+        self.nics = driver_service.get_common_interfaces(


Will this driver_service requires ssh access to each ray nodes without prompt?

You can pass in an identity file now. Otherwise you need passwordless ssh access.

Thank you @richardliaw for your timely responses.

However, is this "get_common_interfaces" strictly necessary? Also asking @tgaddair . And is it possible to set the "HOROVOD_GLOO_IFACE" variable separately on each node using the nic associated the IP returned by ray.service.get_ip_address?

We have tried this at https://github.com/intel-analytics/analytics-zoo to avoid the ssh requirements and it works on our environment, but I am not sure this approach is strictly correct.

@yangw1234 can you make a new issue for this? It is hard to track comments on a closed PR! Thanks!

Sure. Added a new issue. #2271

richardliaw changed the title ~~[ray integration] tests-and-kickoff~~ [ray integration] Initial Integration PR Aug 30, 2020

richardliaw force-pushed the ray-integration branch from b1b4e44 to 98a9ee7 Compare August 30, 2020 21:16

richardliaw commented Aug 31, 2020

View reviewed changes

richardliaw and others added 12 commits September 1, 2020 00:27

tests-and-kickoff

d35f532

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

autopep

68a8616

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

changelog

bdf3a0c

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Added ray extra

8d4db07

Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Added Ray tests to Buildkite

c534d13

Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Fixed issue with cross rank in heterogeneous Gloo jobs (horovod#2212)

dbeeb86

Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Run these github workflows on push events (horovod#2221)

2d92fb9

Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

docs

73e3450

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Mock ray

6cf524f

Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

move-test

da6d3ad

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix

d0351fe

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update-tests-and-run

649b174

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw force-pushed the ray-integration branch from c67e51d to 649b174 Compare September 1, 2020 07:27

pep8!

0324b3b

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw marked this pull request as ready for review September 1, 2020 07:28

richardliaw and others added 2 commits September 1, 2020 00:29

Merge branch 'master' into ray-integration

8a7ad4a

Fixed unit tests

e9d3b3c

Signed-off-by: Travis Addair <taddair@uber.com>

tgaddair reviewed Sep 1, 2020

View reviewed changes

richardliaw commented Sep 1, 2020

View reviewed changes

test/data/expected_buildkite_pipeline.yaml Outdated Show resolved Hide resolved

tgaddair and others added 3 commits September 1, 2020 09:53

Fixed Buildkite config

7fd468c

Signed-off-by: Travis Addair <taddair@uber.com>

address comments

9fd6fbd

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix-runner

b9cf8bc

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw force-pushed the ray-integration branch from 132ca8d to b9cf8bc Compare September 1, 2020 17:15

testexec

c72db0b

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw force-pushed the ray-integration branch from 1eba355 to c72db0b Compare September 1, 2020 18:12

tgaddair reviewed Sep 1, 2020

View reviewed changes

horovod/ray/runner.py Outdated Show resolved Hide resolved

tgaddair reviewed Sep 1, 2020

View reviewed changes

horovod/ray/runner.py Outdated Show resolved Hide resolved

address comments

413185e

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw force-pushed the ray-integration branch from 5bc7d77 to 413185e Compare September 1, 2020 19:09

tgaddair approved these changes Sep 1, 2020

View reviewed changes

tgaddair changed the title ~~[ray integration] Initial Integration PR~~ [ray integration] Initial Ray Integration with RayExecutor API Sep 1, 2020

tgaddair merged commit eeca2c0 into horovod:master Sep 1, 2020

yangw1234 reviewed Sep 13, 2020

View reviewed changes

yangw1234 mentioned this pull request Sep 14, 2020

Make horovod on ray not requiring ssh identity file or passwordless access #2271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ray integration] Initial Ray Integration with RayExecutor API #2218

[ray integration] Initial Ray Integration with RayExecutor API #2218

richardliaw commented Aug 30, 2020 •

edited

Loading

tgaddair commented Aug 31, 2020

richardliaw Aug 31, 2020

tgaddair Sep 1, 2020

richardliaw Sep 1, 2020

tgaddair Sep 1, 2020

tgaddair left a comment

den-run-ai commented Sep 5, 2020

yangw1234 Sep 13, 2020 •

edited

Loading

richardliaw Sep 13, 2020

yangw1234 Sep 13, 2020

richardliaw Sep 14, 2020

yangw1234 Sep 14, 2020


		All actors will be part of the Horovod ring, so ``RayExecutor`` invocations will be able to support arbitrary Horovod collective operations.

		Note that there is an implicit assumption on the cluster being homogenous in shape (i.e., all machines have the same number of slots available). This is simply

		import os

		import ray

[ray integration] Initial Ray Integration with RayExecutor API #2218

[ray integration] Initial Ray Integration with RayExecutor API #2218

Conversation

richardliaw commented Aug 30, 2020 • edited Loading

Checklist before submitting

Description

Review process to land

tgaddair commented Aug 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

den-run-ai commented Sep 5, 2020

yangw1234 Sep 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Aug 30, 2020 •

edited

Loading

yangw1234 Sep 13, 2020 •

edited

Loading