-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix standalone command hanging on kill #23271
Conversation
/cc @andrewgodwin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems very sensible to me.
update... i found only calling not sure why this works though. |
Yeah. Thanks for the context @dstandish . I think this might work but It would be great to find out what was the real cause for the hanging. I was looking at it and scratching my head why this would work and I think it masks rather than solves the problem but maybe this is a good solution for now - but likely we do not need to handle the three signals separately. The reason why I think it might "work" is setsid() not the default signal handling loop. What it does - it effectively changes the child's process group to be different than the parent. The thing is that (not well known fact) when you press Ctrl+C, you do not set SIGINT signal to the process that is running in the foreground, but the SIGINT is sent to the whole process group. By calling setsid in preexec, you set the group of the child process to be the same as process id - effectively creates a new process group. This means that when you press Ctrl+C with standalone process, the SIGINT signal will not be propagated to the children - it will only be sent to the "parent" standalone process to exit. And then the child process will all get SIGPIPE at the first I think what could be the reason for the original implementation initial hanging is a race condition when there are multiple signals handled. The hypothesis I have:
|
See my explanation - my hypotheis was that indeed just setsid() will work :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's "good enough" solution (see my explanation) :)
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
It appears signals were not being forwarded properly. Borrowing some logic from SubprocessHook (namely calling os.setsid in preexec_fn), the issue is resolved.
2e526e9
to
da6c46a
Compare
Ok. I think I got to the bottom of it. I think this is asyncio.sleep function : https://bugs.python.org/issue39622 - seems it's been fixed in 3.11 CPython a month ago python/cpython#83803 and we are likely going to get a fix for that in the new patchlevel releases of 3.7, 3.8. 3.9, 3.10. It looks like it was the Triggerer that caused it - or rather the async.io used by Triggerrer (that's why we have not seen it before triggerer was introduced): Seems that until the fix a month ago the asyncio.sleep KeyboardInterrupt handling was done in async-unsafe way. Looks like very close to my hypothesis. Not 100% sure if that was the case - but it's likely this was the case. The fix with More reading for those interested: https://man7.org/linux/man-pages/man7/signal-safety.7.html Generally we are not supposed to call non https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04_03
|
I think there is a slight chance this one will affect some production deployments. This means that of the triggerer gets SIGINT (Ctrl+C) it might not close cleanly in the current versions of CPython. This is not likely to happen normally when Airflow run in unattended way (usually in such cases SIGTERM is used to terminate processes not SIGINT). But we cannot exclude it and also it makes it not really nice for any kind of debugging and manual running of airflow components. Maybe (@andrewgodwin @ashb @dstandish WDYT?) we could add some protection for that (if we all agree this is a likely reason) - I think that triggerer could simply add a custom SIGINT handler which would exist immediately, rather than print Keyboard Interrupt printing ? |
This might actually be a better fix that could also fix standalone. |
Right. It is triggerer. The reason is that we had a SIGTERM handler for triggerer and it caused a deadlock with the sigterm handler implemented by the async.io in triggerer. Just disabling the SIGTERM handler in triggerer fixes standalone handling and I think it is a better fix. PR is coming. |
I added likely permanent fix in #23274 |
It appears signals were not being forwarded properly. After appropriating the preexec_fn from SubprocessHook, the issue is resolved.
I'm not sure why this code is necessary. I was able to find some context around the original addition of this logic to bash operator:
The initial commit: ca96104
The jira issue: https://issues.apache.org/jira/browse/AIRFLOW-1745
a referenced stackoverflow post: https://stackoverflow.com/questions/22077881/yes-reporting-error-with-subprocess-communicate