You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is actually with the demo from here (also here), but the same error happens also with RETURNN.
This is very likely not really RETURNN related, maybe some hardware issue, but anyway I report it here such that we can find the error for reference later.
nd20-01:275975:275975 [0] NCCL INFO Bootstrap : Using ib0:fe80::ba59:9f03:fc:765c%ib0<0>
nd20-01:275975:275975 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
nd20-01:275975:275975 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
nd20-01:275975:275975 [0] misc/socket.cc:379 NCCL WARN Call to bind failed : Cannot assign requested address
nd20-01:275975:275975 [0] NCCL INFO bootstrap.cc:176 -> 2
nd20-01:275975:275975 [0] NCCL INFO bootstrap.cc:201 -> 2
Traceback (most recent call last):
File "/home/az668407/setups/combined/2021-05-31/tools/playground/torch-distributed-demo.py", line 53, in <module>
ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Call to bind failed : Cannot assign requested address
Traceback (most recent call last):
File "/home/az668407/setups/combined/2021-05-31/tools/playground/torch-distributed-demo.py", line 53, in <module>
ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0')
got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
[2023-11-29 18:07:07,192] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 275975) of binary: /home/az668407/work/py-envs/py3.10-torch2.1/bin/python3.10
The text was updated successfully, but these errors were encountered:
albertz
changed the title
[1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0')
Torch distributed error: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0')
Nov 29, 2023
albertz
changed the title
Torch distributed error: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0')
Torch distributed error: Call to bind failed : Cannot assign requested address
Nov 29, 2023
albertz
changed the title
Torch distributed error: Call to bind failed : Cannot assign requested address
Torch distributed error: ncclSystemError: Call to bind failed : Cannot assign requested address
Nov 29, 2023
This is actually with the demo from here (also here), but the same error happens also with RETURNN.
This is very likely not really RETURNN related, maybe some hardware issue, but anyway I report it here such that we can find the error for reference later.
The text was updated successfully, but these errors were encountered: