Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Ray Train Constants on Multinode causes NCCL Error #30333

Closed
darwinharianto opened this issue Nov 16, 2022 · 6 comments · Fixed by #31824
Closed

[Train] Ray Train Constants on Multinode causes NCCL Error #30333

darwinharianto opened this issue Nov 16, 2022 · 6 comments · Fixed by #31824
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical train Ray Train Related Issue

Comments

@darwinharianto
Copy link

What happened + What you expected to happen

  1. Created a cluster using yaml file from on-prem cluster using docker, ran pytorch test script from the website.
    Everything works fine without GPU, but with GPU, NCCL Error occurs.
    GPU only works if I set --num_workers=1
    Tried export NCCL_IB_DISABLE=1 and export NCCL_P2P_DISABLE=1, both didnt work
    Tried export NCCL_DEBUG=WARN but I cannot see any logs printed on my console

  2. works fine with multiple GPU

  3. error stack

| Trial name               |   # failures | error file                                                                                                      |
|--------------------------+--------------+-----------------------------------------------------------------------------------------------------------------|
| TorchTrainer_96bfe_00000 |            1 | /home/ray/ray_results/TorchTrainer_2022-11-15_19-20-15/TorchTrainer_96bfe_00000_0_2022-11-15_19-20-16/error.txt |
+--------------------------+--------------+-----------------------------------------------------------------------------------------------------------------+

2022-11-15 19:26:36,618	ERROR tune.py:773 -- Trials did not complete: [TorchTrainer_96bfe_00000]
2022-11-15 19:26:36,619	INFO tune.py:778 -- Total run time: 381.25 seconds (380.86 seconds for the tuning loop).
Traceback (most recent call last):
  File "torch_fashion_mnist_example.py", line 155, in <module>
    train_fashion_mnist(num_workers=args.num_workers, use_gpu=args.use_gpu)
  File "torch_fashion_mnist_example.py", line 130, in train_fashion_mnist
    result = trainer.fit()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/base_trainer.py", line 360, in fit
    raise result.error
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=1141, ip=172.21.36.155, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=1174, ip=172.21.36.155, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fb2dd57d050>)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "torch_fashion_mnist_example.py", line 106, in train_func
    model = train.torch.prepare_model(model)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/torch/train_loop_utils.py", line 124, in prepare_model
    parallel_strategy_kwargs=parallel_strategy_kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/torch/train_loop_utils.py", line 365, in prepare_model
    model = DataParallel(model, **parallel_strategy_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:845, internal error, NCCL version 2.7.8
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

Script run normally until the end

Versions / Dependencies

version and dependencies should be the same as ray latest-gpu
OS ubuntu

Reproduction script

made a cluster with ray cluster launcher, here is the yaml

cluster_name: default

docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    container_name: "ray_container"
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536
        - -p 8265:8265

provider:
    type: local
    head_ip: x.x.x.x
    worker_ips: [y.y.y.y, z.z.z.z, w.w.w.w, v.v.v.v]

auth:
    ssh_user: x
    ssh_private_key: ~/.ssh/y

min_workers: 4
max_workers: 4
upscaling_speed: 1.0

idle_timeout_minutes: 5

file_mounts: {
}

cluster_synced_files: []
file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []
ray_cluster -f /home/doors/environment.yaml"]
setup_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

train script from https://docs.ray.io/en/latest/train/examples/torch_fashion_mnist_example.html

run command: python torch_fashion_mnist_example.py --address=x.x.x.x:6379 --num-workers=5 --use-gpu

Issue Severity

High: It blocks me from completing my task.

@darwinharianto darwinharianto added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 16, 2022
@darwinharianto darwinharianto changed the title Ray Train Ray Trainining on Distributed GPU with torch causes NCCL Error Nov 16, 2022
@darwinharianto
Copy link
Author

running using torchrun from https://leimao.github.io/blog/PyTorch-Distributed-Training/ works

torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=x.x.x.x --master_port=1234 resnet_ddp.py --num_epochs=5
torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=x.x.x.x --master_port=1234 resnet_ddp.py --num_epochs=5

@darwinharianto
Copy link
Author

darwinharianto commented Nov 17, 2022

I found the solution
The NCCL_SOCKET_IFNAME for ray train is hardcoded in here

I was using wireless while trying to do distributed training and wl is not included
I have to add DEFAULT_NCCL_SOCKET_IFNAME = "en,eth,bond" to DEFAULT_NCCL_SOCKET_IFNAME = "en,eth,bond,wl"

is there a way to change this constant without directly changing the installed library file?

@darwinharianto darwinharianto changed the title Ray Trainining on Distributed GPU with torch causes NCCL Error Ray Train Constants on Multinode causes NCCL Error Nov 17, 2022
@0asa
Copy link

0asa commented Nov 17, 2022

The recommended way (according to this) seems to be something like (you must do this via Ray runtime environments so it propagates to training nodes):

runtime_env = {"env_vars": {
    "NCCL_SOCKET_IFNAME": "ens5",  
    # "NCCL_DEBUG":"TRACE",
    }}
ray.init(address="auto", runtime_env=runtime_env)

(this solution does not seem to work in my case...)

@darwinharianto
Copy link
Author

darwinharianto commented Nov 18, 2022

@0asa Thanks!
It works, can't believe I missed that section...

maybe you have different network interface on each workers?
Can you share that?

@0asa
Copy link

0asa commented Nov 18, 2022

Hi @darwinharianto !

Happy to help, you're welcome.

And you helped me as well in solving my issue. It was indeed a matter of network interface names.
I had to rename the interfaces (using nmcli) and everything went smoothly.

I'm a bit surprised that Ray overrides the default nccl config with DEFAULT_NCCL_SOCKET_IFNAME.
Knowing one can set NCCL_SOCKET_IFNAME statically as well (using /etc/nccl.conf or ~/nccl.conf), those settings can then be configured on each nodes separately if needed (in my case that would have been the ideal situation), I would suggest to rely on those first.

@hora-anyscale hora-anyscale added P1 Issue that should be fixed within a few weeks train Ray Train Related Issue air labels Nov 18, 2022
@justinvyu justinvyu removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Nov 18, 2022
@matthewdeng matthewdeng changed the title Ray Train Constants on Multinode causes NCCL Error [Train] Ray Train Constants on Multinode causes NCCL Error Nov 21, 2022
@matthewdeng matthewdeng added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Nov 21, 2022
@matthewdeng matthewdeng assigned amogkam and unassigned matthewdeng Nov 21, 2022
@matthewdeng
Copy link
Contributor

@amogkam seems like the current sensible default may not be entirely sensible?

cc @cadedaniel

amogkam added a commit that referenced this issue Jan 24, 2023
Signed-off-by: amogkam amogkamsetty@yahoo.com

Closes #30333.

Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310.

However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: #30333.

Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!)

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical train Ray Train Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants