forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[spark] Fix Gloo detecting incorrect Interfaces on DBR (ray-project#4…
…2202) When running distributed Pytorch without GPUs, Pytorch selects a localhost interface for gloo (i.e. 127.0.0.1:XXX), breaking distributed training. This method in Pytorch can yield the incorrect interface when a) the the hostname resolves locally to the loopback address or b) when hostname lookups fail. This is scoped to DBR specifically because eth0 is guaranteed to exist there. Pytorch+Gloo does not support deny-listing like NCCL (as we do in ray-project#31824) because Pytorch directly uses the environment variable GLOO_SOCKET_IFNAME as the interface to use https://github.com/pytorch/pytorch/blob/7956ca16e649d86cbf11b6e122090fa05678fac3/torch/csrc/distributed/c10d/init.cpp#L2243. Signed-off-by: Ian Rodney <ian.rodney@gmail.com>
- Loading branch information
1 parent
0f435d9
commit e243ed2
Showing
3 changed files
with
20 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters