Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Master Port Number #485

Closed
magic282 opened this issue Nov 9, 2019 · 2 comments
Closed

Random Master Port Number #485

magic282 opened this issue Nov 9, 2019 · 2 comments
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@magic282
Copy link

magic282 commented Nov 9, 2019

I am tring to run two experiments on a 8 GPU machine. Each of them will uses 4 GPUs. When I start the second experiment, I got the following error:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 146, in ddp_train
    model.init_ddp_connection(self.proc_rank, self.world_size)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/pytorch_lightning/root_module/root_module.py", line 153, in init_ddp_connection
    dist.init_process_group('nccl', rank=proc_rank, world_size=world_size)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use

https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/root_module/root_module.py#L134
I saw that I can set the master port manually. But I guess it would be better to set a default port number randomly in a range?

@magic282 magic282 added feature Is an improvement or enhancement help wanted Open to be worked on labels Nov 9, 2019
@magic282 magic282 changed the title Dynamic Master Port Number Random Master Port Number Nov 9, 2019
@williamFalcon
Copy link
Contributor

@magic282 want to submit a PR?

@magic282
Copy link
Author

magic282 commented Dec 5, 2019

@williamFalcon Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants