Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning sends SIGTERM when using other SLURM manager #14893

Closed
YannDubs opened this issue Sep 26, 2022 · 12 comments
Closed

Lightning sends SIGTERM when using other SLURM manager #14893

YannDubs opened this issue Sep 26, 2022 · 12 comments
Assignees
Labels
bug Something isn't working environment: slurm

Comments

@YannDubs
Copy link

YannDubs commented Sep 26, 2022

Bug description

Pytorch lightning does not work when using another tool for SLURM scheduling. In particular, all my jobs receive a lot of SIGTERM when using submitit.

This and similar issues seem to have been raised many times but never solved (maybe due to lack of reproducible code), see: #5969 #5225 (maybe #10154)

How to reproduce the bug

I made a minimal reproducible repo for the bug here. Please see the README there. Needless to say that you need SLURM, and hopefully the error does not depend on SLURM config.

The code only consists of scheduling some model on SLURM and checking the logs. The main code (main.py) runs a logistic regression.

The rest is simply the SLURM config (config/sigterm.py), where you should change the partition for your SLURM.

Once you run python main.py -m this will schedule the job on SLURM and print the logging directory (eg multirun/2022-09-25/20-28-21/). If you open the logging file (eg less multirun/2022-09-25/20-28-21/0/main.log) you should see all the SIGTERM signals Bypassing signal SIGTERM

Error messages and logs

Screenshot 2022-09-25 at 20 53 42

Important info

Please see the requirements.txt. The lightning version is 1.7.7 but I had those SIGTERMs since at least version 1.5

More info

More generally there should be an easy way to deactivate completely SLURM+pytorch lightning. This has already caused many issues (eg #6389 #3651 ) and will probably continue doing so. The thread in #6389 shows that there's really a lot of interest to be able to deactivate (as suggested by @Queuecumber @carmocca ) and it seems very cheap to do.

In my case, I often need 2 pytorch lightning models in a single script (self supervised learning + linear probing) so I want to be able to manage SLURM for multiple lightning trainers and thus don't want lightning to do it for me (there are also other reasons, this is the most prominent).

Tagging people that seem to have thoughts and knowledge about all of that: @awaelchli

@YannDubs YannDubs added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Sep 26, 2022
@awaelchli
Copy link
Contributor

Hi @YannDubs

To be clear, Lightning does not trigger the SIGTERM, right? It is the SLURM cluster. The messages "bypassing signal" you see are from Lightning handling

In 1.6 we introduced a flag auto_requeue=True|False (#10601) that you can set to False if you prefer that Lightning does not handle any signals to requeue the job. Try to set it to False and see if it works for you :)

Also, I think it would be awesome if we had a Submitit example in our SLURM docs :)

@awaelchli awaelchli added environment: slurm and removed needs triage Waiting to be triaged by maintainers labels Sep 26, 2022
@Queuecumber
Copy link
Contributor

I'm pretty sure auto_requeue=False is what you want but I haven't actually tried it

I use PL and submitit quite heavily and I don't have any big issues since landing #14626

I do see these messages in my logs but it doesn't seem like they do anything besides look ugly, I always assumed this wasn't being caused by PL but maybe it's worth looking into?

@Queuecumber
Copy link
Contributor

The messages "bypassing signal" you see are from Lightning handling

They're actually from submitit but they're getting printed multiple times as though the SIGTERM is sent more than once

@awaelchli
Copy link
Contributor

True, they are from submitit. We have a very similar info message in PL, which is why I got mislead.

@Queuecumber
Copy link
Contributor

Also one more thing to keep in mind (it may be unrelated): when using submitit, unless you take particular steps, lightning doesn't even set the signal handlers.

This is because lightning is "polite" and won't set its signal handlers if some library has already set them up. Submitit does its signal very early on in the lifetime of the application so by the time lightning gets around to slurm stuff the handlers from submitit are already present.

I had to work around this by doing signal.signal(signal.SIGUSR2, signal.SIG_DFL) right before I make my Trainer

See facebookincubator/submitit#1709 and https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/trainer/connectors/signal_connector.py#L63

@YannDubs
Copy link
Author

Thanks for the quick answer @Queuecumber @awaelchli

I forgot to say that I tried plugins=[SLURMEnvironment(auto_requeue=False)] but this did not make any difference. I even tried to delete the SLURM environment variables as suggested by this comment but I still see the warnings.

I did some digging: the warning is raised due to CUDAAccelerator.is_available() called in different places when initializing the trainer. In particular, it seems to come from pool.apply(torch.cuda.device.count).

I'm not sure why this sends a SIGTERM and why this line runs with multiprocessing in the first place. Any thoughts?

@Queuecumber
Copy link
Contributor

I have no thoughts other than that this is super weird and interesting

@awaelchli
Copy link
Contributor

awaelchli commented Sep 26, 2022

I did some digging: the warning is raised due to CUDAAccelerator.is_available() called in different places when initializing the trainer. In particular, it seems to come from pool.apply(torch.cuda.device.count).

I'm not sure why this sends a SIGTERM and why this line runs with multiprocessing in the first place. Any thoughts?

This was a workaround for a torch issue in combination with cuda and forking. The code was removed again on master recently for a different solution that does not use multiprocessing. I also can't say why it would be emitting the SIGTERM.

Maybe it's worth it to test your code with the latest version on master. You can install from source via pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/master.zip -U. Hope this helps and sorry for the trouble.

@Queuecumber
Copy link
Contributor

Just to clarify, is this actually crashing your script or is it just that your logs have extra stuff in them?

@YannDubs
Copy link
Author

YannDubs commented Sep 26, 2022

No my scripts aren't crashing because submitit bypassed those, I've actually been having these warning for a year. But my logs are full of them and I wanted to make sure that it was not an issue with our internal SLURM configs. Now I'm confident it is actually not an important warning.

Thanks @awaelchli, there seems to be no error using the last version. Let's see once it's merged and I use it for larger projects.

Thanks to both, I'm closing the issue for now. Although I'm still very surprised about why that happened

@awaelchli
Copy link
Contributor

Thanks @awaelchli, there seems to be no error using the last version. Let's see once it's merged and I use it for larger projects.

Thanks @YannDubs
This will be released in the next few days as part of the 1.8 release.

@Queuecumber
Copy link
Contributor

Actually I think it's good that this is resolved because it may actually be causing a problem.

Apparently you're not supposed to print things in signal handlers and it can cause crashes randomly to do so.

Since submitit is printing inside its signal handlers (and I think lightning does this too) I've actually been getting crashes intermittently.

Of course the more times that print statement is executed the more likely you are to see a crash and because whatever is happening here raises many sigterm which in turn triggers the signal handler and the print, it seems to make this crash much more likely.

Will try this again on 1.8 when it's released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working environment: slurm
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants