Skip to content

Conversation

@amoghrajesh
Copy link
Contributor

@amoghrajesh amoghrajesh commented Sep 17, 2025

closes: #55753

Problem

When tasks are killed by system signals (SIGKILL for OOM, SIGTERM for worker restarts), they immediately go to FAILED state instead of respecting the task retries set and going to UP_FOR_RETRY state. This creates unexpected behavior where exception based failures respect retries but signal based failures don't.

Root Cause

The supervisor's final_state property only checked the exit code and didn't consider retry eligibility for signal based failures. While exception based failures properly checked should_retry which is set by the API server when a task is run for the first time a.k.a, in its run context, signal based failures ignored this logic entirely.

Testing

DAG used:

from airflow import DAG
from datetime import datetime

from airflow.providers.standard.operators.python import PythonOperator


def func():
    a = "asd"
    while True:
        a += a*100000

with DAG(
    dag_id="oom_example",
    start_date=datetime(2024, 1, 1),
    schedule=None,
    catchup=False,
    doc_md=__doc__,
    tags=["oom"],
) as dag:
    hello_task = PythonOperator(
        task_id="oom_task",
        python_callable=func,
        retries=3
    )

Earlier:

The dag immediately moved into FAILED state without any retries;

image

After:

The dag moves into up for retry state as appropriate:

image

See retries:

image

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGABRT and SIGSEGV should not retry

Why? That's not what I would expect to do (and also I suspect not what Airflow 2 did?)

@kaxil
Copy link
Member

kaxil commented Sep 17, 2025

Update: Nvm, thanks to @rawwar figured out that K8s/cgroups has configs to kill only the process


When tasks are killed by system signals (SIGKILL for OOM, SIGTERM for worker restarts), they immediately go to FAILED state instead of respecting the task retries set and going to UP_FOR_RETRY state. This creates unexpected behavior where exception based failures respect retries but signal based failures don't.

How common is this scenario (excluding manually killing task process)? Since the supervisor and task processes are running in the same container, wouldn't an OOM condition typically kill the entire container rather than just the individual task process?

In the more common case where the entire container gets OOM-killed:

  1. The supervisor process would also die
  2. Heartbeat to the scheduler would fail
  3. Scheduler would receive a FAILED executor event and handle retries through the normal process_executor_events()handle_failure() path

@ashb
Copy link
Member

ashb commented Sep 19, 2025

@kaxil Also people do occasionally run Airflow outside of Kubernetes you know 😉

@amoghrajesh
Copy link
Contributor Author

I'll get to the comments on this later, not needed for 3.1, can wait till 3.1.1

@amoghrajesh amoghrajesh added this to the Airflow 3.1.1 milestone Sep 19, 2025
@amoghrajesh
Copy link
Contributor Author

@ashb replied to some of your comments, could you take a look when possible?

@amoghrajesh amoghrajesh requested a review from ashb September 29, 2025 08:02
@kaxil kaxil modified the milestones: Airflow 3.1.1, Airflow 3.1.2 Oct 21, 2025
@rawwar
Copy link
Contributor

rawwar commented Oct 27, 2025

Verified by running a task that gets killed due to oom and the task went to "up_for_retry"

image

Just wanted to let it run all tries. It did fail at the end:
image

Copy link
Contributor

@rawwar rawwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@potiuk potiuk added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label Oct 27, 2025
@amoghrajesh amoghrajesh force-pushed the handle-signal-task-better branch from 3191733 to 24ef66f Compare October 28, 2025 08:36
@amoghrajesh amoghrajesh merged commit de0c78e into apache:main Oct 28, 2025
82 checks passed
@amoghrajesh amoghrajesh deleted the handle-signal-task-better branch October 28, 2025 09:38
github-actions bot pushed a commit that referenced this pull request Oct 28, 2025
(cherry picked from commit de0c78e)

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
@github-actions
Copy link

Backport successfully created: v3-1-test

Status Branch Result
v3-1-test PR Link

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Oct 28, 2025
(cherry picked from commit de0c78e)

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
kaxil pushed a commit that referenced this pull request Oct 31, 2025
kaxil pushed a commit that referenced this pull request Oct 31, 2025
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:task-sdk backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch type:bug-fix Changelog: Bug Fixes

Development

Successfully merging this pull request may close these issues.

Task does not retry when worker is killed due to OOM

6 participants