-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switching from USR1 Breaks Pytorch Lightning #1709
Comments
Actually it turns out that changing the signal alone wasnt enough to fix the problem so now I'm trying to figure out how this was ever working for me (it was working on older versions of everything) |
OK I figured it out, seems like at some point lightning added the following:
so if someone (i.e. submitit) already set a signal handler they wont overwrite. Simple to fix with
to clear out the submitit signal handler in my |
Hi, I'm not sure I understand correctly. |
I contributed a fix to PL already that lets users change the signal to whatever they want, so I can tell it to use SIGUSR2 now.
so what I would like (and will probably open PRs for) is a proper parameter that can be passed to submitit that tells it what signal to use along with another setting that tells it not to set signal handlers. But let's discuss it. |
The recent change of which signal to have slurm send for preemption breaks a flow with pytorch lightning.
Lightning currently has USR1 hardcoded in its logic for automatically saving an HPC checkpoint and calling
scontrol
on its own. Generally I've found it much easier to use this than to use submitit to handle the requeue, so submitit is only used for launching (via hydra).Probably lightning should also make this configurable but it would also be nice if this was a first-class option we could pass to submitit somehow. Currently I have it set using the environment variable, but this is ugly and not something that can be configured through hydra.
Another option would be some option to tell submitit to stay out of the process entirely since it shouldn't really even be handling signals in this case (though it still needs to tell slurm to send them).
So I just wanted to flag this with you guys so we can explore solutions, NCCL issues aside my understanding was that USR1 is the standard signal to use for this.
The text was updated successfully, but these errors were encountered: