-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Ray Train Constants on Multinode causes NCCL Error #30333
Comments
running using torchrun from https://leimao.github.io/blog/PyTorch-Distributed-Training/ works
|
I found the solution I was using wireless while trying to do distributed training and wl is not included is there a way to change this constant without directly changing the installed library file? |
The recommended way (according to this) seems to be something like (you must do this via Ray runtime environments so it propagates to training nodes): runtime_env = {"env_vars": {
"NCCL_SOCKET_IFNAME": "ens5",
# "NCCL_DEBUG":"TRACE",
}}
ray.init(address="auto", runtime_env=runtime_env) (this solution does not seem to work in my case...) |
@0asa Thanks! maybe you have different network interface on each workers? |
Hi @darwinharianto ! Happy to help, you're welcome. And you helped me as well in solving my issue. It was indeed a matter of network interface names. I'm a bit surprised that Ray overrides the default |
@amogkam seems like the current sensible default may not be entirely sensible? cc @cadedaniel |
Signed-off-by: amogkam amogkamsetty@yahoo.com Closes #30333. Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310. However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: #30333. Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!) Signed-off-by: amogkam <amogkamsetty@yahoo.com>
What happened + What you expected to happen
Created a cluster using yaml file from on-prem cluster using docker, ran pytorch test script from the website.
Everything works fine without GPU, but with GPU, NCCL Error occurs.
GPU only works if I set
--num_workers=1
Tried
export NCCL_IB_DISABLE=1
andexport NCCL_P2P_DISABLE=1
, both didnt workTried
export NCCL_DEBUG=WARN
but I cannot see any logs printed on my consoleworks fine with multiple GPU
error stack
Script run normally until the end
Versions / Dependencies
version and dependencies should be the same as ray latest-gpu
OS ubuntu
Reproduction script
made a cluster with ray cluster launcher, here is the yaml
train script from https://docs.ray.io/en/latest/train/examples/torch_fashion_mnist_example.html
run command:
python torch_fashion_mnist_example.py --address=x.x.x.x:6379 --num-workers=5 --use-gpu
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: