Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching from USR1 Breaks Pytorch Lightning #1709

Open
Queuecumber opened this issue Sep 8, 2022 · 4 comments
Open

Switching from USR1 Breaks Pytorch Lightning #1709

Queuecumber opened this issue Sep 8, 2022 · 4 comments

Comments

@Queuecumber
Copy link
Contributor

The recent change of which signal to have slurm send for preemption breaks a flow with pytorch lightning.

Lightning currently has USR1 hardcoded in its logic for automatically saving an HPC checkpoint and calling scontrol on its own. Generally I've found it much easier to use this than to use submitit to handle the requeue, so submitit is only used for launching (via hydra).

Probably lightning should also make this configurable but it would also be nice if this was a first-class option we could pass to submitit somehow. Currently I have it set using the environment variable, but this is ugly and not something that can be configured through hydra.

Another option would be some option to tell submitit to stay out of the process entirely since it shouldn't really even be handling signals in this case (though it still needs to tell slurm to send them).

So I just wanted to flag this with you guys so we can explore solutions, NCCL issues aside my understanding was that USR1 is the standard signal to use for this.

@Queuecumber
Copy link
Contributor Author

Actually it turns out that changing the signal alone wasnt enough to fix the problem so now I'm trying to figure out how this was ever working for me (it was working on older versions of everything)

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Sep 8, 2022

OK I figured it out, seems like at some point lightning added the following:

if sigusr1_handlers and not self._has_already_handler(signal.SIGUSR1):
                self._register_signal(signal.SIGUSR1, HandlersCompose(sigusr1_handlers))

def _has_already_handler(signum: _SIGNUM) -> bool:
        return signal.getsignal(signum) not in (None, signal.SIG_DFL)

so if someone (i.e. submitit) already set a signal handler they wont overwrite. Simple to fix with

# HACK reset the signal handling so the lightning is free to set it
signal.signal(signal.SIGUSR1, signal.SIG_DFL)

to clear out the submitit signal handler in my __init__.py

@gwenzek
Copy link
Contributor

gwenzek commented Sep 14, 2022

Hi, I'm not sure I understand correctly.
What does pytorch lightning want ? to handle USR1 itself or it want submitit to handle USR1 for you ?
note that you now have the env var SUBMITIT_PREEMPT_SIGNAL to control the signal used, so you could set that to USR1 to get old behavior I believe.

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Sep 14, 2022

I contributed a fix to PL already that lets users change the signal to whatever they want, so I can tell it to use SIGUSR2 now.

  1. Lightning was assuming USR1 as the preemption signal because it handles checkpointing and requeueing by itself, my understanding is that this is the standard signal people are using so changing the default may not have been the best idea
  2. The env variable is certainly a possibility but it's not convenient to change and I have to tell people to do it manually I can't have it in a hydra config for example. My contribution to PL pretty much mitigates this because we can work around submitit's behavior now by telling PL to catch USR2 instead of USR1 (and that setting can go into a hydra config)
  3. PL is respectful of other libraries so it won't set its signal handler if submitit set one already. In order to use their auto-checkpointing and requeueing I need to unset the signal handler (signal.signal(signal.SIGUSR2, signal.SIG_DFL)) after submitit has started up.

so what I would like (and will probably open PRs for) is a proper parameter that can be passed to submitit that tells it what signal to use along with another setting that tells it not to set signal handlers. But let's discuss it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants