trainer.test not working in ddp #2683

asrafulashiq · 2020-07-24T00:42:02Z

🐛 Bug

Testing in ddp is not working in the latest master.

To Reproduce

I am using the gpu_template example from basic_examples in the repo : "python gpu_template.py --gpus 2 --distributed_backend ddp", where, instead of trainer.fit(model), I am using trainer.test(model).

I am getting "RuntimeError: connect() timed out".

Environment

PyTorch Version 1.3.1:
Ubuntu 18.04

github-actions · 2020-07-24T00:42:52Z

Hi! thanks for your contribution!, great first issue!

jmarsil · 2020-07-24T14:41:02Z

I'm facing a similar issue to @asrafulashiq . I trained a model using ddp across 4gpus on our institution's slurm cluster. When running trainer.test(model) it seems like the script freezes. Help on this would be greatly appreciated!! Here is a screenshot of the output of my slurm script for reference.

rakhimovv · 2020-07-26T12:19:44Z

the same issue

awaelchli · 2020-08-01T12:02:41Z

can confirm, it freezes, but only in ddp. For everyone who faces this issue, use ddp_spawn until this is fixed.

awaelchli · 2020-08-01T12:19:37Z

I've identified that the process hangs at this line:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f9ccb0fd9b53973a87b18cf9126ad9454b328575/pytorch_lightning/core/lightning.py#L958

awaelchli · 2020-08-01T13:19:05Z

Ok, more progress. The issue is caused because the master port is different on each rank. If we set the port manually, problem is solved.

asrafulashiq added bug Something isn't working help wanted Open to be worked on labels Jul 24, 2020

edenlightning added distributed Generic distributed-related topic Important labels Jul 29, 2020

awaelchli self-assigned this Aug 1, 2020

awaelchli mentioned this issue Aug 1, 2020

[WIP] Fix Trainer.test in ddp before running Trainer.fit #2790

Closed

2 tasks

awaelchli mentioned this issue Aug 16, 2020

ddp fix for trainer.test() + add basic ddp tests #2997

Merged

7 tasks

williamFalcon closed this as completed in #2997 Aug 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer.test not working in ddp #2683

trainer.test not working in ddp #2683

asrafulashiq commented Jul 24, 2020

github-actions bot commented Jul 24, 2020

jmarsil commented Jul 24, 2020

rakhimovv commented Jul 26, 2020

awaelchli commented Aug 1, 2020

awaelchli commented Aug 1, 2020

awaelchli commented Aug 1, 2020

trainer.test not working in ddp #2683

trainer.test not working in ddp #2683

Comments

asrafulashiq commented Jul 24, 2020

🐛 Bug

To Reproduce

Environment

github-actions bot commented Jul 24, 2020

jmarsil commented Jul 24, 2020

rakhimovv commented Jul 26, 2020

awaelchli commented Aug 1, 2020

awaelchli commented Aug 1, 2020

awaelchli commented Aug 1, 2020