-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError on run_training_teardown method on the master branch #1999
Comments
Probably this line here Not sure |
Investigated. This error should have shown up in the tests. |
@awaelchli so changing this to |
No, I tested it, the problem is not there. The problem is that the |
Yes. I think, that's why I explicitly passed the |
Maybe we can wrap the cleanup code into a closure that binds self and then the decorator can be applied to this closure function. Not sure though, have not looked at the details |
Do you want to take this over? Otherwise I'd try to make some time for it later/tomorrow :) |
more broadly, why is that function needed? we already have teardown that works on ctrl+c no? And we already have that signal registered... |
This would start teardown for all kills except SIGKILL (like SIGTERM etc.). There are clusters (like non-SLURM) that also need this kind of signal handling. And I think we should do cleanup whenever a job ends if possible (either exit after program end or exit on error). Otherwise you may get issues with checkpointing etc. Also this would maybe enable proper hparam logging with metrics (not sure about that though). |
I think I better keep my hands away from it. I can't test the SLURM signals anyway due to lack of this setup. |
SLURM shall be also running on CPU :] |
@awaelchli please see comments. i'm not sure we should have this handler thing. i think it'll also break ddp |
🐛 Bug
I switched to the
master
branch in order to test the bugfix for #1919 but the same code that was running on the stable version0.7.6
is not running anymore.To Reproduce
Maybe just switching to the recent master branch would reproduce the issue.
Expected behavior
I would've expected it to run exactly the same way as it does while using the
0.7.6
versionEnvironment
conda
,pip
, source): pipAdditional context
The same error is present while running the
collect_env_details.py
script.The text was updated successfully, but these errors were encountered: