[Bug] RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout #285
Labels
bug
Something isn't working
Describe the bug
It's a probabilistic occurrence, Socket Timeout when group.allreduce([tensor], opts)
if group in _world.pg_coalesce_state.keys():
# We are in coalescing context, do not issue single operation, just append a collective representation
coll = _CollOp(all_reduce, tensor, None, op, None)
_world.pg_coalesce_state[group].append(coll)
_world.pg_coalesce_state[group].append(coll)
_world.pg_coalesce_state[group].append(coll)
_world.pg_coalesce_state[group].append(coll)
if async_op:
return _IllegalWork()
else:
return None
E RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
E Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:445 (most recent call first):
More information: https://github.com/InternLM/InternEvo/actions/runs/10012995770/job/27889229378
Environment
python3.10
torch2.1
Other information
No response
The text was updated successfully, but these errors were encountered: