-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Respect task retries for signal killed tasks #55767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect task retries for signal killed tasks #55767
Conversation
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIGABRT and SIGSEGV should not retry
Why? That's not what I would expect to do (and also I suspect not what Airflow 2 did?)
|
Update: Nvm, thanks to @rawwar figured out that K8s/cgroups has configs to kill only the process
How common is this scenario (excluding manually killing task process)? Since the supervisor and task processes are running in the same container, wouldn't an OOM condition typically kill the entire container rather than just the individual task process? In the more common case where the entire container gets OOM-killed:
|
|
@kaxil Also people do occasionally run Airflow outside of Kubernetes you know 😉 |
|
I'll get to the comments on this later, not needed for 3.1, can wait till 3.1.1 |
|
@ashb replied to some of your comments, could you take a look when possible? |
rawwar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
3191733 to
24ef66f
Compare
(cherry picked from commit de0c78e) Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
(cherry picked from commit de0c78e) Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
(cherry picked from commit de0c78e)
(cherry picked from commit de0c78e)


closes: #55753
Problem
When tasks are killed by system signals (SIGKILL for OOM, SIGTERM for worker restarts), they immediately go to FAILED state instead of respecting the task retries set and going to UP_FOR_RETRY state. This creates unexpected behavior where exception based failures respect retries but signal based failures don't.
Root Cause
The supervisor's
final_stateproperty only checked the exit code and didn't consider retry eligibility for signal based failures. While exception based failures properly checkedshould_retrywhich is set by the API server when a task is run for the first time a.k.a, in its run context, signal based failures ignored this logic entirely.Testing
DAG used:
Earlier:
The dag immediately moved into FAILED state without any retries;
After:
The dag moves into up for retry state as appropriate:
See retries:
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.