Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disables siginterrupt for SIGUSR1 #1844

Merged
merged 1 commit into from
Sep 12, 2016
Merged

Conversation

daveFNbuck
Copy link
Contributor

Description

Sets siginterrupt to False for SIGUSR1 when installing the shutdown handler.

Motivation and Context

I've recently started using SIGUSR1 to stop old workers on deploy and each deploy has resulted in all MapReduce tasks dying with IOError: [Errno 4] Interrupted system call. To avoid this, we want the SIGUSR1 signal to not interrupt system calls.

Have you tested this? If so, how?

I wrote some unit tests and tested it heavily in production both by sending lots of SIGUSR1 signals to running MRs and by seeing the regular stream of errors during deploys stop.

I've recently started using SIGUSR1 to stop old workers on deploy and each
deploy has resulted in all MapReduce tasks dying with
`IOError: [Errno 4] Interrupted system call`. Setting siginterrupt to
False for SIGUSR1 prevents this error for me.
@Tarrasch
Copy link
Contributor

Tarrasch commented Sep 12, 2016

Looks good to me. The concept of siginterrupt is something new to me (and I feel I still don't understand it after lots of reading). What is the "system call" that is interrupted? Is it a system-call that opens child-processes (like map reduce jobs)?

@Tarrasch Tarrasch merged commit 6d7a958 into spotify:master Sep 12, 2016
@daveFNbuck daveFNbuck deleted the no_signal_interrupt branch September 12, 2016 16:46
@daveFNbuck
Copy link
Contributor Author

Siginterrupt is new to me as well, but I was able to solve the problem thanks to some Googling. I can't find the Stack Overflow thread that helped me solve it again though :(. The system call that got interrupted most of the time was reading the stderr pipe in the spawned subprocess for the mapreduce on line 295 of luigi/contrib/hadoop.py. https://github.com/spotify/luigi/blob/master/luigi/contrib/hadoop.py#L295

If the signal happened during scheduling, it could kill the multiprocessing queue get during a call to conn.recv() which I'm guessing is also a pipe read.

This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants