Respect task retries for signal killed tasks #55767

amoghrajesh · 2025-09-17T11:33:30Z

Problem

When tasks are killed by system signals (SIGKILL for OOM, SIGTERM for worker restarts), they immediately go to FAILED state instead of respecting the task retries set and going to UP_FOR_RETRY state. This creates unexpected behavior where exception based failures respect retries but signal based failures don't.

Root Cause

The supervisor's final_state property only checked the exit code and didn't consider retry eligibility for signal based failures. While exception based failures properly checked should_retry which is set by the API server when a task is run for the first time a.k.a, in its run context, signal based failures ignored this logic entirely.

Testing

DAG used:

from airflow import DAG
from datetime import datetime

from airflow.providers.standard.operators.python import PythonOperator


def func():
    a = "asd"
    while True:
        a += a*100000

with DAG(
    dag_id="oom_example",
    start_date=datetime(2024, 1, 1),
    schedule=None,
    catchup=False,
    doc_md=__doc__,
    tags=["oom"],
) as dag:
    hello_task = PythonOperator(
        task_id="oom_task",
        python_callable=func,
        retries=3
    )

Earlier:

The dag immediately moved into FAILED state without any retries;

After:

The dag moves into up for retry state as appropriate:

See retries:

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

ashb

SIGABRT and SIGSEGV should not retry

Why? That's not what I would expect to do (and also I suspect not what Airflow 2 did?)

task-sdk/src/airflow/sdk/execution_time/supervisor.py

kaxil · 2025-09-17T19:18:27Z

Update: Nvm, thanks to @rawwar figured out that K8s/cgroups has configs to kill only the process

When tasks are killed by system signals (SIGKILL for OOM, SIGTERM for worker restarts), they immediately go to FAILED state instead of respecting the task retries set and going to UP_FOR_RETRY state. This creates unexpected behavior where exception based failures respect retries but signal based failures don't.

How common is this scenario (excluding manually killing task process)? Since the supervisor and task processes are running in the same container, wouldn't an OOM condition typically kill the entire container rather than just the individual task process?

In the more common case where the entire container gets OOM-killed:

The supervisor process would also die
Heartbeat to the scheduler would fail
Scheduler would receive a FAILED executor event and handle retries through the normal process_executor_events() → handle_failure() path

ashb · 2025-09-19T08:50:17Z

@kaxil Also people do occasionally run Airflow outside of Kubernetes you know 😉

amoghrajesh · 2025-09-19T12:11:35Z

I'll get to the comments on this later, not needed for 3.1, can wait till 3.1.1

amoghrajesh · 2025-09-24T08:48:00Z

@ashb replied to some of your comments, could you take a look when possible?

task-sdk/src/airflow/sdk/execution_time/supervisor.py

rawwar · 2025-10-27T10:43:19Z

Verified by running a task that gets killed due to oom and the task went to "up_for_retry"

Just wanted to let it run all tries. It did fail at the end:

rawwar

Thanks!

(cherry picked from commit de0c78e) Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

github-actions · 2025-10-28T09:39:27Z

Backport successfully created: v3-1-test

Status	Branch	Result
✅	v3-1-test

(cherry picked from commit de0c78e) Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

(cherry picked from commit de0c78e)

amoghrajesh added 2 commits September 17, 2025 16:38

Respect task retries for signal killed tasks

49e5e17

adding unit tests

7eca5a6

amoghrajesh requested review from ashb and kaxil as code owners September 17, 2025 11:33

boring-cyborg bot added the area:task-sdk label Sep 17, 2025

amoghrajesh requested a review from rawwar September 17, 2025 11:34

amoghrajesh self-assigned this Sep 17, 2025

amoghrajesh added this to AIP 72 (addendum): Golang Task SDK Sep 17, 2025

amoghrajesh moved this to In progress in AIP 72 (addendum): Golang Task SDK Sep 17, 2025

amoghrajesh added this to AIP-72 - Task Execution Interface and SDK Sep 17, 2025

amoghrajesh removed this from AIP 72 (addendum): Golang Task SDK Sep 17, 2025

amoghrajesh moved this to In Progress in AIP-72 - Task Execution Interface and SDK Sep 17, 2025

ashb reviewed Sep 17, 2025

View reviewed changes

amoghrajesh added this to the Airflow 3.1.1 milestone Sep 19, 2025

amoghrajesh added 2 commits September 24, 2025 14:10

handling comments from ash

d8390a8

Merge branch 'main' into handle-signal-task-better

89a3d8a

ashb reviewed Sep 24, 2025

View reviewed changes

task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated Show resolved Hide resolved

retry any signal

ac514f2

amoghrajesh requested a review from ashb September 29, 2025 08:02

kaxil modified the milestones: Airflow 3.1.1, Airflow 3.1.2 Oct 21, 2025

rawwar approved these changes Oct 27, 2025

View reviewed changes

ashb approved these changes Oct 27, 2025

View reviewed changes

potiuk approved these changes Oct 27, 2025

View reviewed changes

potiuk added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label Oct 27, 2025

Merge branch 'main' into handle-signal-task-better

24ef66f

amoghrajesh force-pushed the handle-signal-task-better branch from 3191733 to 24ef66f Compare October 28, 2025 08:36

amoghrajesh merged commit de0c78e into apache:main Oct 28, 2025
82 checks passed

amoghrajesh deleted the handle-signal-task-better branch October 28, 2025 09:38

github-project-automation bot moved this from In Progress to Done in AIP-72 - Task Execution Interface and SDK Oct 28, 2025

github-actions bot pushed a commit that referenced this pull request Oct 28, 2025

[v3-1-test] Respect task retries for signal killed tasks (#55767)

881ecfa

(cherry picked from commit de0c78e) Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Oct 28, 2025

[v3-1-test] Respect task retries for signal killed tasks (apache#55767)

bca562d

(cherry picked from commit de0c78e) Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

kaxil pushed a commit that referenced this pull request Oct 31, 2025

Respect task retries for signal killed tasks (#55767)

d41adc5

(cherry picked from commit de0c78e)

kaxil pushed a commit that referenced this pull request Oct 31, 2025

Respect task retries for signal killed tasks (#55767)

0717362

(cherry picked from commit de0c78e)

This was referenced Oct 31, 2025

Sync v3-1-stable with 3.1.2rc1 changes #57640

Merged

Status of testing of Apache Airflow 3.1.2rc2 & Task SDK 1.1.2rc2 #57648

Closed

ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Respect task retries for signal killed tasks #55767

Respect task retries for signal killed tasks #55767

Uh oh!

amoghrajesh commented Sep 17, 2025 •

edited

Loading

Uh oh!

ashb left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaxil commented Sep 17, 2025 •

edited

Loading

Uh oh!

ashb commented Sep 19, 2025

Uh oh!

amoghrajesh commented Sep 19, 2025

Uh oh!

amoghrajesh commented Sep 24, 2025

Uh oh!

Uh oh!

rawwar commented Oct 27, 2025 •

edited

Loading

Uh oh!

rawwar left a comment

Uh oh!

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Respect task retries for signal killed tasks #55767

Respect task retries for signal killed tasks #55767

Uh oh!

Conversation

amoghrajesh commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Testing

Earlier:

After:

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaxil commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashb commented Sep 19, 2025

Uh oh!

amoghrajesh commented Sep 19, 2025

Uh oh!

amoghrajesh commented Sep 24, 2025

Uh oh!

Uh oh!

rawwar commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rawwar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Oct 28, 2025

Backport successfully created: v3-1-test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amoghrajesh commented Sep 17, 2025 •

edited

Loading

kaxil commented Sep 17, 2025 •

edited

Loading

rawwar commented Oct 27, 2025 •

edited

Loading