Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainer.test not working in ddp #2683

Closed
asrafulashiq opened this issue Jul 24, 2020 · 6 comments · Fixed by #2997
Closed

trainer.test not working in ddp #2683

asrafulashiq opened this issue Jul 24, 2020 · 6 comments · Fixed by #2997
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on

Comments

@asrafulashiq
Copy link
Contributor

🐛 Bug

Testing in ddp is not working in the latest master.

To Reproduce

I am using the gpu_template example from basic_examples in the repo : "python gpu_template.py --gpus 2 --distributed_backend ddp", where, instead of trainer.fit(model), I am using trainer.test(model).

I am getting "RuntimeError: connect() timed out".

Environment

  • PyTorch Version 1.3.1:
  • Ubuntu 18.04
@asrafulashiq asrafulashiq added bug Something isn't working help wanted Open to be worked on labels Jul 24, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@jmarsil
Copy link

jmarsil commented Jul 24, 2020

I'm facing a similar issue to @asrafulashiq . I trained a model using ddp across 4gpus on our institution's slurm cluster. When running trainer.test(model) it seems like the script freezes. Help on this would be greatly appreciated!! Here is a screenshot of the output of my slurm script for reference.

Screen Shot 2020-07-24 at 10 38 26 AM

@rakhimovv
Copy link

the same issue

@edenlightning edenlightning added distributed Generic distributed-related topic Important labels Jul 29, 2020
@awaelchli awaelchli self-assigned this Aug 1, 2020
@awaelchli
Copy link
Contributor

can confirm, it freezes, but only in ddp. For everyone who faces this issue, use ddp_spawn until this is fixed.

@awaelchli
Copy link
Contributor

@awaelchli
Copy link
Contributor

Ok, more progress. The issue is caused because the master port is different on each rank. If we set the port manually, problem is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on
Projects
None yet
5 participants