Skip to content

TI history missing after Scheduler restart during K8s 429 error #49517

@whynick1

Description

@whynick1

Apache Airflow version

2.10.5

If "Other Airflow 2 version" selected, which one?

No response

What happened?

When a task pod launches successfully, but the Kubernetes API server starts returning 429 Too Many Requests errors:

  • KubernetesJobWatcher crashes, causing the Airflow Scheduler to restart.
  • Upon restart, the Scheduler fails to re-adopt the running pod because the K8s API remains unavailable due to continued 429s.
  • As a result, the task is marked orphaned and its state is reset to None.
  • Airflow's logic only calls TaskInstanceHistory.record_ti() during failure handling if the task was in a running state. Since the state is now reset to None, record_ti() is never called.

Consequently, there is no TaskInstanceHistory record, and the Airflow UI shows missing log links for that attempt.

What you think should happen instead?

Even if a task becomes orphaned and its state is reset to None, Airflow should still record a TaskInstanceHistory entry to maintain a complete log history for user troubleshooting. We only (record TI history when state is running).

            if ti.state == TaskInstanceState.RUNNING:
                # If the task instance is in the running state, it means it raised an exception and
                # about to retry so we record the task instance history. For other states, the task
                # instance was cleared and already recorded in the task instance history.
                from airflow.models.taskinstancehistory import TaskInstanceHistory

                TaskInstanceHistory.record_ti(ti, session=session)

How to reproduce

Steps to trigger this behavior:

  1. Launch a task pod successfully in Airflow running with KubernetesExecutor or CeleryKubernetesExecutor.
  2. Artificially throttle the Kubernetes API server (e.g., by applying API rate limiting policies or load testing the API) so that it starts returning 429 Too Many Requests consistently.
  3. Observe that:
  • KubernetesJobWatcher crashes.
  • Scheduler restarts.
  • Scheduler is unable to re-adopt the running task pod.
  • The task is marked as orphaned.
  • TaskInstance state is reset to None.
  • No TaskInstanceHistory entry is created for the failed attempt.
  • Airflow UI shows missing log link for the corresponding attempt.

Operating System

Debian GNU/Linux

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions