Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix standalone command hanging on kill #23271

Closed

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Apr 26, 2022

It appears signals were not being forwarded properly. After appropriating the preexec_fn from SubprocessHook, the issue is resolved.

I'm not sure why this code is necessary. I was able to find some context around the original addition of this logic to bash operator:
The initial commit: ca96104
The jira issue: https://issues.apache.org/jira/browse/AIRFLOW-1745
a referenced stackoverflow post: https://stackoverflow.com/questions/22077881/yes-reporting-error-with-subprocess-communicate

@dstandish dstandish requested review from ashb and jedcunningham April 26, 2022 18:40
@dstandish dstandish requested a review from uranusjr April 26, 2022 18:52
@ashb
Copy link
Member

ashb commented Apr 26, 2022

/cc @andrewgodwin

Copy link
Contributor

@andrewgodwin andrewgodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems very sensible to me.

@dstandish
Copy link
Contributor Author

update... i found only calling os.setsid() in preexec_fn solves it. removed the other stuff (which i think is a py2 thing)

not sure why this works though.

@potiuk
Copy link
Member

potiuk commented Apr 26, 2022

Yeah. Thanks for the context @dstandish .

I think this might work but It would be great to find out what was the real cause for the hanging.

I was looking at it and scratching my head why this would work and I think it masks rather than solves the problem but maybe this is a good solution for now - but likely we do not need to handle the three signals separately.

The reason why I think it might "work" is setsid() not the default signal handling loop. What it does - it effectively changes the child's process group to be different than the parent.

The thing is that (not well known fact) when you press Ctrl+C, you do not set SIGINT signal to the process that is running in the foreground, but the SIGINT is sent to the whole process group. By calling setsid in preexec, you set the group of the child process to be the same as process id - effectively creates a new process group.

This means that when you press Ctrl+C with standalone process, the SIGINT signal will not be propagated to the children - it will only be sent to the "parent" standalone process to exit. And then the child process will all get SIGPIPE at the first
momennt when they are all trying to write any kind of output to the PIPE from parent process. So they will get killed eventually.

I think what could be the reason for the original implementation initial hanging is a race condition when there are multiple signals handled.

The hypothesis I have:

  • Ctrl+C is pressed
  • all processes get SIGINT
  • the parent process handles it and exits
  • the child proces writes "Keyboard Interrupt" to PIPE and receives SIGPIPE which causes a deadlock.

@potiuk
Copy link
Member

potiuk commented Apr 26, 2022

update... i found only calling os.setsid() in preexec_fn solves it. removed the other stuff (which i think is a py2 thing)
not sure why this works though.

See my explanation - my hypotheis was that indeed just setsid() will work :)

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's "good enough" solution (see my explanation) :)

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Apr 26, 2022
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

It appears signals were not being forwarded properly.  Borrowing some logic from SubprocessHook (namely calling os.setsid in preexec_fn), the issue is resolved.
@dstandish dstandish force-pushed the fix-standalone-command-hanging branch from 2e526e9 to da6c46a Compare April 26, 2022 20:10
@potiuk
Copy link
Member

potiuk commented Apr 26, 2022

Ok. I think I got to the bottom of it. I think this is asyncio.sleep function : https://bugs.python.org/issue39622 - seems it's been fixed in 3.11 CPython a month ago python/cpython#83803 and we are likely going to get a fix for that in the new patchlevel releases of 3.7, 3.8. 3.9, 3.10.

It looks like it was the Triggerer that caused it - or rather the async.io used by Triggerrer (that's why we have not seen it before triggerer was introduced):

Seems that until the fix a month ago the asyncio.sleep KeyboardInterrupt handling was done in async-unsafe way. Looks like very close to my hypothesis. Not 100% sure if that was the case - but it's likely this was the case.

The fix with setsid() would mitigate the problem because SIGINT was not propagated to triggerer. Previously while aysncio.sleep was running, got the SIGINT, started handling it and printed KeyboarInterrupted - and triggerer deadlocked itself. By making children it their own process groups, we prevent the SIGINT to propagate to the children. But the children (including the Triggerer) have PIPE to print to the parent process and that pipe gets closed when the main process terminates, so they will instead get SIGPIPE at the moment they try to write to it - and terminate.

More reading for those interested:

https://man7.org/linux/man-pages/man7/signal-safety.7.html

Generally we are not supposed to call non async-signal-safe function in signal handler. For example printf is not async-signal-safe. Depending on the OS semantics signals might be delivered while signals are being handled

https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04_03

Any function not in the above table may be unsafe with respect to signals. Implementations may make other interfaces async-signal-safe. In the presence of signals, all functions defined by this volume of POSIX.1-2017 shall behave as defined when called from or interrupted by a signal-catching function, with the exception that when a signal interrupts an unsafe function or equivalent (such as the processing equivalent to exit() performed after a return from the initial call to main()) and the signal-catching function calls an unsafe function, the behavior is undefined. Additional exceptions are specified in the descriptions of individual functions such as longjmp().

@potiuk
Copy link
Member

potiuk commented Apr 26, 2022

I think there is a slight chance this one will affect some production deployments. This means that of the triggerer gets SIGINT (Ctrl+C) it might not close cleanly in the current versions of CPython. This is not likely to happen normally when Airflow run in unattended way (usually in such cases SIGTERM is used to terminate processes not SIGINT). But we cannot exclude it and also it makes it not really nice for any kind of debugging and manual running of airflow components.

Maybe (@andrewgodwin @ashb @dstandish WDYT?) we could add some protection for that (if we all agree this is a likely reason) - I think that triggerer could simply add a custom SIGINT handler which would exist immediately, rather than print Keyboard Interrupt printing ?

@potiuk
Copy link
Member

potiuk commented Apr 26, 2022

Maybe (@andrewgodwin @ashb @dstandish WDYT?) we could add some protection for that (if we all agree this is a likely reason) - I think that triggerer could simply add a custom SIGINT handler which would exist immediately, rather than print Keyboard Interrupt printing ?

This might actually be a better fix that could also fix standalone.

@potiuk
Copy link
Member

potiuk commented Apr 26, 2022

Right. It is triggerer. The reason is that we had a SIGTERM handler for triggerer and it caused a deadlock with the sigterm handler implemented by the async.io in triggerer. Just disabling the SIGTERM handler in triggerer fixes standalone handling and I think it is a better fix. PR is coming.

@potiuk
Copy link
Member

potiuk commented Apr 26, 2022

I added likely permanent fix in #23274

@dstandish
Copy link
Contributor Author

I added likely permanent fix in #23274

Nice, thanks @potiuk

@dstandish dstandish closed this Apr 27, 2022
@ashb ashb deleted the fix-standalone-command-hanging branch June 10, 2022 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants