-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Open
Labels
AIP-64Task Instance historyTask Instance historyarea:loggingkind:bugThis is a clearly a bugThis is a clearly a bugprovider:cncf-kubernetesKubernetes (k8s) provider related issuesKubernetes (k8s) provider related issues
Description
Apache Airflow version
2.10.5
If "Other Airflow 2 version" selected, which one?
No response
What happened?
When a task pod launches successfully, but the Kubernetes API server starts returning 429 Too Many Requests errors:
- KubernetesJobWatcher crashes, causing the Airflow Scheduler to restart.
- Upon restart, the Scheduler fails to re-adopt the running pod because the K8s API remains unavailable due to continued 429s.
- As a result, the task is marked orphaned and its state is reset to None.
- Airflow's logic only calls TaskInstanceHistory.record_ti() during failure handling if the task was in a running state. Since the state is now reset to None, record_ti() is never called.
Consequently, there is no TaskInstanceHistory record, and the Airflow UI shows missing log links for that attempt.
What you think should happen instead?
Even if a task becomes orphaned and its state is reset to None, Airflow should still record a TaskInstanceHistory entry to maintain a complete log history for user troubleshooting. We only (record TI history when state is running).
if ti.state == TaskInstanceState.RUNNING:
# If the task instance is in the running state, it means it raised an exception and
# about to retry so we record the task instance history. For other states, the task
# instance was cleared and already recorded in the task instance history.
from airflow.models.taskinstancehistory import TaskInstanceHistory
TaskInstanceHistory.record_ti(ti, session=session)
How to reproduce
Steps to trigger this behavior:
- Launch a task pod successfully in Airflow running with KubernetesExecutor or CeleryKubernetesExecutor.
- Artificially throttle the Kubernetes API server (e.g., by applying API rate limiting policies or load testing the API) so that it starts returning 429 Too Many Requests consistently.
- Observe that:
- KubernetesJobWatcher crashes.
- Scheduler restarts.
- Scheduler is unable to re-adopt the running task pod.
- The task is marked as orphaned.
- TaskInstance state is reset to None.
- No TaskInstanceHistory entry is created for the failed attempt.
- Airflow UI shows missing log link for the corresponding attempt.
Operating System
Debian GNU/Linux
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
AIP-64Task Instance historyTask Instance historyarea:loggingkind:bugThis is a clearly a bugThis is a clearly a bugprovider:cncf-kubernetesKubernetes (k8s) provider related issuesKubernetes (k8s) provider related issues