Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray integration] Initial Ray Integration with RayExecutor API #2218

Merged
merged 20 commits into from
Sep 1, 2020

Conversation

richardliaw
Copy link
Collaborator

@richardliaw richardliaw commented Aug 30, 2020

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Initial PR to introduce a Ray runner for Horovod. The interface is currently specific to using Actors (stateful operators).

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@richardliaw richardliaw changed the title [ray integration] tests-and-kickoff [ray integration] Initial Integration PR Aug 30, 2020
@tgaddair
Copy link
Collaborator

Looks good so far! A few of things to help with getting CI to pass:

  1. We should add a horovod[ray] extra in setup.py similar to what we do for Spark here.
  2. Install using extras [spark,ray] for both Dockerfile.test.cpu and Dockerfile.test.gpu.
  3. Exclude Ray tests from MPI configs in Buildkite (as this feature is Gloo-only) by adding them to the exclude_standalone_test variable.
  4. Make sure these tests run in "standalone" mode for Gloo, meaning we don't launch them with horovodrun, as shown here.
  5. Run BUILDKITE_PIPELINE_SLUG=SLUG BUILDKITE_BRANCH=BRANCH .buildkite/gen-pipeline.sh > test/data/expected_buildkite_pipeline.yaml to update expected Buildkite pipeline definition.

Hopefully, that should get everything working in CI.

Comment on lines 9 to 11
import os

import ray
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file probably needs a little bit of clean-up; it's written in a way for multi-node setups but we can 1. get the pytest infrastructure up and 2. make it work on single-node setups first.

richardliaw and others added 12 commits September 1, 2020 00:27
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw marked this pull request as ready for review September 1, 2020 07:28
richardliaw and others added 2 commits September 1, 2020 00:29
Signed-off-by: Travis Addair <taddair@uber.com>

All actors will be part of the Horovod ring, so ``RayExecutor`` invocations will be able to support arbitrary Horovod collective operations.

Note that there is an implicit assumption on the cluster being homogenous in shape (i.e., all machines have the same number of slots available). This is simply
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed following #2212. We could consider allowing hosts to be a list of host:slot in a follow-up PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

examples/tensorflow2_mnist_ray.py Outdated Show resolved Hide resolved
examples/tensorflow2_mnist_ray.py Show resolved Hide resolved
horovod/ray/runner.py Outdated Show resolved Hide resolved
tgaddair and others added 3 commits September 1, 2020 09:53
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
horovod/ray/runner.py Outdated Show resolved Hide resolved
# colocated workers.
gpu_ids = ray.get_gpu_ids()
for worker, gpu_id in zip(self.workers, gpu_ids):
worker.update_env_vars.remote({"CUDA_VISIBLE_DEVICES": gpu_id})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu_id is only one GPU, is that correct?

For NCCL, we need to avoid setting CUDA_VISIBLE_DEVICES. NCCL needs to have visibility of adjacent devices in order to use CUDA IPC. Typically, we handle GPU isolation at the framework level, for example:

torch.cuda.set_device(hvd.local_rank())

If you need to set CUDA_VISIBLE_DEVICES because multiple Ray applications may be using the same nodes, then we can set "CUDA_VISIBLE_DEVICES": gpu_ids so every worker can see the devices in its local communicator.

horovod/ray/runner.py Outdated Show resolved Hide resolved
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Will land once tests pass.

@tgaddair tgaddair changed the title [ray integration] Initial Integration PR [ray integration] Initial Ray Integration with RayExecutor API Sep 1, 2020
@tgaddair tgaddair merged commit eeca2c0 into horovod:master Sep 1, 2020
@den-run-ai
Copy link
Contributor

Super-exciting PR!

self.global_rendezv_port = self.rendezvous.start()
self.rendezvous.init(host_alloc_plan)
# remote_host_names = network.filter_local_addresses()
self.nics = driver_service.get_common_interfaces(
Copy link

@yangw1234 yangw1234 Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this driver_service requires ssh access to each ray nodes without prompt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pass in an identity file now. Otherwise you need passwordless ssh access.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @richardliaw for your timely responses.

However, is this "get_common_interfaces" strictly necessary? Also asking @tgaddair . And is it possible to set the "HOROVOD_GLOO_IFACE" variable separately on each node using the nic associated the IP returned by ray.service.get_ip_address?

We have tried this at https://github.com/intel-analytics/analytics-zoo to avoid the ssh requirements and it works on our environment, but I am not sure this approach is strictly correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yangw1234 can you make a new issue for this? It is hard to track comments on a closed PR! Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Added a new issue. #2271

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants