-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ray integration] Initial Ray Integration with RayExecutor API #2218
Conversation
b1b4e44
to
98a9ee7
Compare
Looks good so far! A few of things to help with getting CI to pass:
Hopefully, that should get everything working in CI. |
test/test_cluster_ray.py
Outdated
import os | ||
|
||
import ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file probably needs a little bit of clean-up; it's written in a way for multi-node setups but we can 1. get the pytest infrastructure up and 2. make it work on single-node setups first.
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Travis Addair <taddair@uber.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
c67e51d
to
649b174
Compare
Signed-off-by: Travis Addair <taddair@uber.com>
|
||
All actors will be part of the Horovod ring, so ``RayExecutor`` invocations will be able to support arbitrary Horovod collective operations. | ||
|
||
Note that there is an implicit assumption on the cluster being homogenous in shape (i.e., all machines have the same number of slots available). This is simply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be addressed following #2212. We could consider allowing hosts
to be a list of host:slot
in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
132ca8d
to
b9cf8bc
Compare
1eba355
to
c72db0b
Compare
horovod/ray/runner.py
Outdated
# colocated workers. | ||
gpu_ids = ray.get_gpu_ids() | ||
for worker, gpu_id in zip(self.workers, gpu_ids): | ||
worker.update_env_vars.remote({"CUDA_VISIBLE_DEVICES": gpu_id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu_id
is only one GPU, is that correct?
For NCCL, we need to avoid setting CUDA_VISIBLE_DEVICES
. NCCL needs to have visibility of adjacent devices in order to use CUDA IPC. Typically, we handle GPU isolation at the framework level, for example:
torch.cuda.set_device(hvd.local_rank())
If you need to set CUDA_VISIBLE_DEVICES
because multiple Ray applications may be using the same nodes, then we can set "CUDA_VISIBLE_DEVICES": gpu_ids
so every worker can see the devices in its local communicator.
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
5bc7d77
to
413185e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Will land once tests pass.
Super-exciting PR! |
self.global_rendezv_port = self.rendezvous.start() | ||
self.rendezvous.init(host_alloc_plan) | ||
# remote_host_names = network.filter_local_addresses() | ||
self.nics = driver_service.get_common_interfaces( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this driver_service
requires ssh access to each ray nodes without prompt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can pass in an identity file now. Otherwise you need passwordless ssh access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @richardliaw for your timely responses.
However, is this "get_common_interfaces" strictly necessary? Also asking @tgaddair . And is it possible to set the "HOROVOD_GLOO_IFACE" variable separately on each node using the nic associated the IP returned by ray.service.get_ip_address
?
We have tried this at https://github.com/intel-analytics/analytics-zoo to avoid the ssh requirements and it works on our environment, but I am not sure this approach is strictly correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yangw1234 can you make a new issue for this? It is hard to track comments on a closed PR! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Added a new issue. #2271
Checklist before submitting
Description
Initial PR to introduce a Ray runner for Horovod. The interface is currently specific to using Actors (stateful operators).
Review process to land