-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lightning sends SIGTERM when using other SLURM manager #14893
Comments
Hi @YannDubs To be clear, Lightning does not trigger the SIGTERM, right? It is the SLURM cluster. The messages "bypassing signal" you see are from Lightning handling In 1.6 we introduced a flag Also, I think it would be awesome if we had a Submitit example in our SLURM docs :) |
I'm pretty sure I use PL and submitit quite heavily and I don't have any big issues since landing #14626 I do see these messages in my logs but it doesn't seem like they do anything besides look ugly, I always assumed this wasn't being caused by PL but maybe it's worth looking into? |
They're actually from submitit but they're getting printed multiple times as though the SIGTERM is sent more than once |
True, they are from submitit. We have a very similar info message in PL, which is why I got mislead. |
Also one more thing to keep in mind (it may be unrelated): when using submitit, unless you take particular steps, lightning doesn't even set the signal handlers. This is because lightning is "polite" and won't set its signal handlers if some library has already set them up. Submitit does its signal very early on in the lifetime of the application so by the time lightning gets around to slurm stuff the handlers from submitit are already present. I had to work around this by doing See facebookincubator/submitit#1709 and https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/trainer/connectors/signal_connector.py#L63 |
Thanks for the quick answer @Queuecumber @awaelchli I forgot to say that I tried I did some digging: the warning is raised due to I'm not sure why this sends a SIGTERM and why this line runs with multiprocessing in the first place. Any thoughts? |
I have no thoughts other than that this is super weird and interesting |
This was a workaround for a torch issue in combination with cuda and forking. The code was removed again on master recently for a different solution that does not use multiprocessing. I also can't say why it would be emitting the SIGTERM. Maybe it's worth it to test your code with the latest version on master. You can install from source via |
Just to clarify, is this actually crashing your script or is it just that your logs have extra stuff in them? |
No my scripts aren't crashing because submitit bypassed those, I've actually been having these warning for a year. But my logs are full of them and I wanted to make sure that it was not an issue with our internal SLURM configs. Now I'm confident it is actually not an important warning. Thanks @awaelchli, there seems to be no error using the last version. Let's see once it's merged and I use it for larger projects. Thanks to both, I'm closing the issue for now. Although I'm still very surprised about why that happened |
Thanks @YannDubs |
Actually I think it's good that this is resolved because it may actually be causing a problem. Apparently you're not supposed to print things in signal handlers and it can cause crashes randomly to do so. Since submitit is printing inside its signal handlers (and I think lightning does this too) I've actually been getting crashes intermittently. Of course the more times that print statement is executed the more likely you are to see a crash and because whatever is happening here raises many sigterm which in turn triggers the signal handler and the print, it seems to make this crash much more likely. Will try this again on 1.8 when it's released |
Bug description
Pytorch lightning does not work when using another tool for SLURM scheduling. In particular, all my jobs receive a lot of
SIGTERM
when using submitit.This and similar issues seem to have been raised many times but never solved (maybe due to lack of reproducible code), see: #5969 #5225 (maybe #10154)
How to reproduce the bug
I made a minimal reproducible repo for the bug here. Please see the README there. Needless to say that you need SLURM, and hopefully the error does not depend on SLURM config.
The code only consists of scheduling some model on SLURM and checking the logs. The main code (
main.py
) runs a logistic regression.The rest is simply the SLURM config (
config/sigterm.py
), where you should change the partition for your SLURM.Once you run
python main.py -m
this will schedule the job on SLURM and print the logging directory (egmultirun/2022-09-25/20-28-21/
). If you open the logging file (egless multirun/2022-09-25/20-28-21/0/main.log
) you should see all the SIGTERM signalsBypassing signal SIGTERM
Error messages and logs
Important info
Please see the requirements.txt. The lightning version is
1.7.7
but I had those SIGTERMs since at least version 1.5More info
More generally there should be an easy way to deactivate completely SLURM+pytorch lightning. This has already caused many issues (eg #6389 #3651 ) and will probably continue doing so. The thread in #6389 shows that there's really a lot of interest to be able to deactivate (as suggested by @Queuecumber @carmocca ) and it seems very cheap to do.
In my case, I often need 2 pytorch lightning models in a single script (self supervised learning + linear probing) so I want to be able to manage SLURM for multiple lightning trainers and thus don't want lightning to do it for me (there are also other reasons, this is the most prominent).
Tagging people that seem to have thoughts and knowledge about all of that: @awaelchli
The text was updated successfully, but these errors were encountered: