You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am tring to run two experiments on a 8 GPU machine. Each of them will uses 4 GPUs. When I start the second experiment, I got the following error:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 146, in ddp_train
model.init_ddp_connection(self.proc_rank, self.world_size)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/pytorch_lightning/root_module/root_module.py", line 153, in init_ddp_connection
dist.init_process_group('nccl', rank=proc_rank, world_size=world_size)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use
I am tring to run two experiments on a 8 GPU machine. Each of them will uses 4 GPUs. When I start the second experiment, I got the following error:
https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/root_module/root_module.py#L134
I saw that I can set the master port manually. But I guess it would be better to set a default port number randomly in a range?
The text was updated successfully, but these errors were encountered: