-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.10.5
What happened?
After upgrading to Airflow 2.10.5, we received the following deprecation warning:
/home/airflow/.local/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py:41 DeprecationWarning: The '[kubernetes_executor] worker_pods_pending_timeout' config option is deprecated. Please update your config to use '[scheduler] task_queued_timeout' instead.
-
We updated our configuration by setting task_queued_timeout = 300 under the [scheduler] section. A few days later, we noticed that several tasks in our production DAGs were failing after remaining in the queued state for 5 minutes. However, no failure alerts were triggered, even though the DAGs are configured to retry 3 times and send alerts via Slack.
-
We assumed that since the tasks never started execution, the alerting mechanism might not have been triggered. However, our understanding is that task_queued_timeout should mark the task as failed and trigger alerts.
-
To mitigate the issue, we increased the timeout to 900 seconds (15 minutes). This worked for about three weeks, but we are now seeing the same issue again—tasks are failing after being queued for over 15 minutes, without any alerts. This is happening despite having 100-node capacity available in our Databricks instance pool.
Either the task_queued_timeout is not integrated with the alerting mechanism or the tasks are not being scheduled properly despite available resources, possibly due to a scheduling or executor issue.
What you think should happen instead?
- Tasks that fail due to task_queued_timeout should trigger failure alerts, just like any other task failure.
- Tasks should not remain in the queued state for extended periods if sufficient executor capacity is available.
How to reproduce
- Upgrade to Airflow 2.10.5.
- Set task_queued_timeout = 300 in airflow.cfg under [scheduler].
- Deploy a DAG with tasks that may experience scheduling delays.
- Observe tasks failing after 5 minutes in the queued state without triggering alerts.
- Increase timeout to 900 seconds and observe recurrence under similar conditions.
Operating System
linux/amd64
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
We maintain our helm charts in a git repo and deploy them in Kubernetes using ArgoCD. All our tasks use Databricks compute to run the tasks.
Anything else?
- Does not appear to be a callback code error, as it works for other tasks/DAGs.
- Logs for the failed task show:
*** Could not read served logs: Invalid URL 'http://:8793/log/dag_id=...': No host supplied
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
