-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trainer.test not working in ddp #2683
Comments
Hi! thanks for your contribution!, great first issue! |
I'm facing a similar issue to @asrafulashiq . I trained a model using ddp across 4gpus on our institution's slurm cluster. When running trainer.test(model) it seems like the script freezes. Help on this would be greatly appreciated!! Here is a screenshot of the output of my slurm script for reference. |
the same issue |
can confirm, it freezes, but only in ddp. For everyone who faces this issue, use ddp_spawn until this is fixed. |
I've identified that the process hangs at this line: |
Ok, more progress. The issue is caused because the master port is different on each rank. If we set the port manually, problem is solved. |
🐛 Bug
Testing in ddp is not working in the latest master.
To Reproduce
I am using the gpu_template example from basic_examples in the repo : "python gpu_template.py --gpus 2 --distributed_backend ddp", where, instead of trainer.fit(model), I am using trainer.test(model).
I am getting "RuntimeError: connect() timed out".
Environment
The text was updated successfully, but these errors were encountered: