Skip to content

Issue with task_queued_timeout Causing Silent Task Failures in Airflow 2.10.5 #51301

@KarthikeyanDevendhiran

Description

@KarthikeyanDevendhiran

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.5

What happened?

After upgrading to Airflow 2.10.5, we received the following deprecation warning:

/home/airflow/.local/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py:41 DeprecationWarning: The '[kubernetes_executor] worker_pods_pending_timeout' config option is deprecated. Please update your config to use '[scheduler] task_queued_timeout' instead.

  • We updated our configuration by setting task_queued_timeout = 300 under the [scheduler] section. A few days later, we noticed that several tasks in our production DAGs were failing after remaining in the queued state for 5 minutes. However, no failure alerts were triggered, even though the DAGs are configured to retry 3 times and send alerts via Slack.

  • We assumed that since the tasks never started execution, the alerting mechanism might not have been triggered. However, our understanding is that task_queued_timeout should mark the task as failed and trigger alerts.

  • To mitigate the issue, we increased the timeout to 900 seconds (15 minutes). This worked for about three weeks, but we are now seeing the same issue again—tasks are failing after being queued for over 15 minutes, without any alerts. This is happening despite having 100-node capacity available in our Databricks instance pool.

Either the task_queued_timeout is not integrated with the alerting mechanism or the tasks are not being scheduled properly despite available resources, possibly due to a scheduling or executor issue.

What you think should happen instead?

  • Tasks that fail due to task_queued_timeout should trigger failure alerts, just like any other task failure.
  • Tasks should not remain in the queued state for extended periods if sufficient executor capacity is available.

How to reproduce

  • Upgrade to Airflow 2.10.5.
  • Set task_queued_timeout = 300 in airflow.cfg under [scheduler].
  • Deploy a DAG with tasks that may experience scheduling delays.
  • Observe tasks failing after 5 minutes in the queued state without triggering alerts.
  • Increase timeout to 900 seconds and observe recurrence under similar conditions.

Operating System

linux/amd64

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Image

We maintain our helm charts in a git repo and deploy them in Kubernetes using ArgoCD. All our tasks use Databricks compute to run the tasks.

Anything else?

  • Does not appear to be a callback code error, as it works for other tasks/DAGs.
  • Logs for the failed task show:
    *** Could not read served logs: Invalid URL 'http://:8793/log/dag_id=...': No host supplied

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

area:Schedulerincluding HA (high availability) schedulerarea:corekind:bugThis is a clearly a bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions