-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs #21397
Comments
It sounds like this would be fixed by adding an index on dag model table, right? |
Yes, but we can't add an index on the |
Oh thanks Mysql. Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables. |
In this case would it make sense to add a |
It might, but if you have another solution that feels less hacky (as I don't like the "split" column all that much) lets take a look |
A lot. |
Now that one fix has been proposed (#21399) and I have validated it in our Airflow deployments, what do you think about the proposed implementation? Any preferences between:
I am happy to help implement either change but would really like some careful review and discussion of the proposed methods. |
Closed in #21399 |
Apache Airflow version
2.2.2
What happened
When we upgraded from Airflow 2.1.2 to 2.2.2 we noticed that our MySQL instance was running near 100% CPU utilization while it was closer to 20% on Airflow 2.1.2.
After a long investigation, we found that the queries introduced in #17121 are extremely hard on the database, especially when there are multiple schedulers, a high value of
parsing_processes
and a large number of DAGs.Our setup is as follows:
dag
tableparsing_processes=32
So each time a DAG is parsed, a query like this will be run:
And because
dag
isn't indexed onfileloc
, the query ends up doing a full-table-scan (or nearly a full-table-scan), and this is repeated for every single file which is processed.When I added these queries, I tested the change in a local
breeze
environment with a relatively small number of DAGs and thus did not notice the performance implications.At our scale / configuration, we have approximately 128 of these poorly-performant queries running in parallel, each scanning approximately 20,000 rows. Understandably this was really hard on the database which ended up drastically impacting the performance of other queries.
We were able to reduce the impact by lowering the
parsing_processes
, cleaning up old entries in thedag
table and increasing themin_file_processing_interval
, but none of these mitigations really address the root of the problem.We are currently working on a fix which moves this cleanup to the DAG Processor manager and eliminates un-indexed queries and should be able to submit a preliminary pull request for review in the next few days.
What you expected to happen
Removing DAGs which no longer exist in files should not put so much strain on the database.
How to reproduce
Use a high number of
parsing_processes
and/or scheduler replicas in an Airflow 2.2+ Environment with many DAGs.Operating System
Debian GNU/Linux 10 (buster)
Deployment
Other 3rd-party Helm chart
Deployment details
Airflow 2.2.2 on Kubernetes
MySQL 8
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: