[Train] Ray Train Constants on Multinode causes NCCL Error #30333

darwinharianto · 2022-11-16T05:30:30Z

What happened + What you expected to happen

Created a cluster using yaml file from on-prem cluster using docker, ran pytorch test script from the website.
Everything works fine without GPU, but with GPU, NCCL Error occurs.
GPU only works if I set --num_workers=1
Tried export NCCL_IB_DISABLE=1 and export NCCL_P2P_DISABLE=1, both didnt work
Tried export NCCL_DEBUG=WARN but I cannot see any logs printed on my console
works fine with multiple GPU
error stack

| Trial name               |   # failures | error file                                                                                                      |
|--------------------------+--------------+-----------------------------------------------------------------------------------------------------------------|
| TorchTrainer_96bfe_00000 |            1 | /home/ray/ray_results/TorchTrainer_2022-11-15_19-20-15/TorchTrainer_96bfe_00000_0_2022-11-15_19-20-16/error.txt |
+--------------------------+--------------+-----------------------------------------------------------------------------------------------------------------+

2022-11-15 19:26:36,618	ERROR tune.py:773 -- Trials did not complete: [TorchTrainer_96bfe_00000]
2022-11-15 19:26:36,619	INFO tune.py:778 -- Total run time: 381.25 seconds (380.86 seconds for the tuning loop).
Traceback (most recent call last):
  File "torch_fashion_mnist_example.py", line 155, in <module>
    train_fashion_mnist(num_workers=args.num_workers, use_gpu=args.use_gpu)
  File "torch_fashion_mnist_example.py", line 130, in train_fashion_mnist
    result = trainer.fit()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/base_trainer.py", line 360, in fit
    raise result.error
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=1141, ip=172.21.36.155, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=1174, ip=172.21.36.155, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fb2dd57d050>)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "torch_fashion_mnist_example.py", line 106, in train_func
    model = train.torch.prepare_model(model)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/torch/train_loop_utils.py", line 124, in prepare_model
    parallel_strategy_kwargs=parallel_strategy_kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/torch/train_loop_utils.py", line 365, in prepare_model
    model = DataParallel(model, **parallel_strategy_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:845, internal error, NCCL version 2.7.8
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

Script run normally until the end

Versions / Dependencies

version and dependencies should be the same as ray latest-gpu
OS ubuntu

Reproduction script

made a cluster with ray cluster launcher, here is the yaml

cluster_name: default

docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    container_name: "ray_container"
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536
        - -p 8265:8265

provider:
    type: local
    head_ip: x.x.x.x
    worker_ips: [y.y.y.y, z.z.z.z, w.w.w.w, v.v.v.v]

auth:
    ssh_user: x
    ssh_private_key: ~/.ssh/y

min_workers: 4
max_workers: 4
upscaling_speed: 1.0

idle_timeout_minutes: 5

file_mounts: {
}

cluster_synced_files: []
file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []
ray_cluster -f /home/doors/environment.yaml"]
setup_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

train script from https://docs.ray.io/en/latest/train/examples/torch_fashion_mnist_example.html

run command: python torch_fashion_mnist_example.py --address=x.x.x.x:6379 --num-workers=5 --use-gpu

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

darwinharianto · 2022-11-17T04:26:02Z

running using torchrun from https://leimao.github.io/blog/PyTorch-Distributed-Training/ works

torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=x.x.x.x --master_port=1234 resnet_ddp.py --num_epochs=5

torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=x.x.x.x --master_port=1234 resnet_ddp.py --num_epochs=5

darwinharianto · 2022-11-17T06:10:24Z

I found the solution
The NCCL_SOCKET_IFNAME for ray train is hardcoded in here

I was using wireless while trying to do distributed training and wl is not included
I have to add DEFAULT_NCCL_SOCKET_IFNAME = "en,eth,bond" to DEFAULT_NCCL_SOCKET_IFNAME = "en,eth,bond,wl"

is there a way to change this constant without directly changing the installed library file?

0asa · 2022-11-17T15:50:17Z

The recommended way (according to this) seems to be something like (you must do this via Ray runtime environments so it propagates to training nodes):

runtime_env = {"env_vars": {
    "NCCL_SOCKET_IFNAME": "ens5",  
    # "NCCL_DEBUG":"TRACE",
    }}
ray.init(address="auto", runtime_env=runtime_env)

(this solution does not seem to work in my case...)

darwinharianto · 2022-11-18T01:12:00Z

@0asa Thanks!
It works, can't believe I missed that section...

maybe you have different network interface on each workers?
Can you share that?

0asa · 2022-11-18T14:43:55Z

Hi @darwinharianto !

Happy to help, you're welcome.

And you helped me as well in solving my issue. It was indeed a matter of network interface names.
I had to rename the interfaces (using nmcli) and everything went smoothly.

I'm a bit surprised that Ray overrides the default nccl config with DEFAULT_NCCL_SOCKET_IFNAME.
Knowing one can set NCCL_SOCKET_IFNAME statically as well (using /etc/nccl.conf or ~/nccl.conf), those settings can then be configured on each nodes separately if needed (in my case that would have been the ideal situation), I would suggest to rely on those first.

matthewdeng · 2022-11-21T04:06:11Z

@amogkam seems like the current sensible default may not be entirely sensible?

cc @cadedaniel

@cadedaniel

Signed-off-by: amogkam amogkamsetty@yahoo.com Closes #30333. Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310. However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: #30333. Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!) Signed-off-by: amogkam <amogkamsetty@yahoo.com>

darwinharianto added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 16, 2022

darwinharianto changed the title ~~Ray Train~~ Ray Trainining on Distributed GPU with torch causes NCCL Error Nov 16, 2022

darwinharianto changed the title ~~Ray Trainining on Distributed GPU with torch causes NCCL Error~~ Ray Train Constants on Multinode causes NCCL Error Nov 17, 2022

hora-anyscale added P1 Issue that should be fixed within a few weeks train Ray Train Related Issue air labels Nov 18, 2022

hora-anyscale assigned matthewdeng Nov 18, 2022

justinvyu removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Nov 18, 2022

matthewdeng changed the title ~~Ray Train Constants on Multinode causes NCCL Error~~ [Train] Ray Train Constants on Multinode causes NCCL Error Nov 21, 2022

matthewdeng added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Nov 21, 2022

matthewdeng assigned amogkam and unassigned matthewdeng Nov 21, 2022

amogkam mentioned this issue Jan 21, 2023

[Train] Change default NCCL_SOCKET_IFNAME to blacklist veth #31824

Merged

7 tasks

amogkam closed this as completed in #31824 Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Ray Train Constants on Multinode causes NCCL Error #30333

[Train] Ray Train Constants on Multinode causes NCCL Error #30333

darwinharianto commented Nov 16, 2022

darwinharianto commented Nov 17, 2022

darwinharianto commented Nov 17, 2022 •

edited

Loading

0asa commented Nov 17, 2022

darwinharianto commented Nov 18, 2022 •

edited

Loading

0asa commented Nov 18, 2022

matthewdeng commented Nov 21, 2022

[Train] Ray Train Constants on Multinode causes NCCL Error #30333

[Train] Ray Train Constants on Multinode causes NCCL Error #30333

Comments

darwinharianto commented Nov 16, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

darwinharianto commented Nov 17, 2022

darwinharianto commented Nov 17, 2022 • edited Loading

0asa commented Nov 17, 2022

darwinharianto commented Nov 18, 2022 • edited Loading

0asa commented Nov 18, 2022

matthewdeng commented Nov 21, 2022

darwinharianto commented Nov 17, 2022 •

edited

Loading

darwinharianto commented Nov 18, 2022 •

edited

Loading