[E socket.cpp:922] [c10d] The client socket has timed out after 1800s #19257
Unanswered
kfoynt
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment 1 reply
-
Any news with this? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi
I am trying to train a model using 2 GPUs in 1 node on SLURM.
But I am getting the following error:
[E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (watgpu108, 19747). Traceback (most recent call last): File "/u1/kfountou/two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py", line 681, in <module> main(args) File "/u1/kfountou/two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py", line 668, in main trainer.fit(lightning_model, my_dataset) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 947, in _run self.strategy.setup_environment() File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 147, in setup_environment self.setup_distributed() File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 198, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 290, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group store, rank, world_size = next(rendezvous_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u1/kfountou/.conda/envs/jupyter-server/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store return TCPStore( ^^^^^^^^^ TimeoutError: The client socket has timed out after 1800s while trying to connect to (watgpu108, 19747). srun: error: watgpu108: task 1: Exited with exit code 1
Here is my sbatch file:
#SBATCH --nodes=1
#SBATCH --mem=96GB
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
source activate jupyter-server
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
srun python two_iterations_of_compare_and_swap_seventh_attempt_LIGHTNING.py
and my trainer
trainer = pl.Trainer(precision="bf16-mixed", accelerator="gpu", devices=2, num_nodes=1, strategy='ddp', max_epochs=100000)
I am very stuck on this. I have been googling and trying potential solutions for hours, but I still get the same problem.
I tried changing the backend to gloo, but I get the same issue.
Any help would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions