-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow version
3.0.5
If "Other Airflow 2 version" selected, which one?
No response
What happened?
I am running Airflow with the official helm chart on GKE. I use a template which generates multiple DAGs for a source in a single file:
- Extract (cron)
- Raw to bronze (asset trigger)
- Bronze to silver (asset trigger)
- Silver to gold (asset trigger)
- Delta maintenance (cron)
If a task gets stuck running due to pod OOM errors then the DAG disappears from the UI. I am unable to see it when I search for the DAG or Asset name.
Checking the DAG processor logs shows there is an error:
Bundle File Path PID Current Duration # DAGs # Errors Last Duration Last Run At
------------------- --------------------------------------------------------- ------ ------------------ -------- ---------- --------------- -------------------
gitlab-sources-main dags/history_dag.py 0 1 0.27s 2025-08-28T14:40:58
However, if I check the Task Instances tab and filter for running tasks I can see the tasks which were OOM. They are stuck running which causes the DAG processor to fail to parse the other DAGs in the same file:
After I mark those tasks as failed and wait a few minutes, the DAGs are parsed successfully:
Bundle File Path PID Current Duration # DAGs # Errors Last Duration Last Run At
------------------- --------------------------------------------------------- ------ ------------------ -------- ---------- --------------- -------------------
gitlab-sources-main dags/history_dag.py 5 0 0.81s 2025-08-28T14:59:28
And the DAGs are now viewable in the UI:
Also, I should add that the tasks are also not searchable:
But If I remove the search filter and only select Running tasks then I can see the root cause:
What you think should happen instead?
The DAG processor should be able to parse the DAG file whether or not a task is stuck running. It makes it very hard to know there is an issue if I cannot see the DAG in the UI. Not being able to search for the DAG or asset is also strange because it clearly exists in the DB and you can see it in the Task Instances page.
The fact that only one out of five DAGs had a problem but all five disappear is also not ideal.
How to reproduce
- Generate multiple related (asset trigger) DAGs in a single DAG file
- Hit OOM errors in a Kubernetes pod which cause the task to get stuck running instead of failing outright
- The DAG processor will eventually fail to parse the the DAG file as a whole, and then related DAGs will disappear from the UI
Operating System
Debian Bookworm
Versions of Apache Airflow Providers
This has been an issue since upgrading to 3.0.0 (so several versions of providers). Here is the current list though:
apache-airflow-providers-cncf-kubernetes==10.6.2
apache-airflow-providers-common-compat==1.7.3
apache-airflow-providers-common-io==1.6.2
apache-airflow-providers-common-sql==1.27.4
apache-airflow-providers-docker==4.4.2
apache-airflow-providers-fab==2.3.1
apache-airflow-providers-ftp==3.13.2
apache-airflow-providers-git==0.0.5
apache-airflow-providers-google==17.0.0
apache-airflow-providers-http==5.3.3
apache-airflow-providers-microsoft-azure==12.6.0
apache-airflow-providers-odbc==4.10.2
apache-airflow-providers-postgres==6.2.2
apache-airflow-providers-sftp==5.3.3
apache-airflow-providers-smtp==2.1.2
apache-airflow-providers-ssh==4.1.2
apache-airflow-providers-standard==1.5.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
ArgoCD deploys the Helm chart via Kustomize
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct