Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs #21397

Closed
2 tasks done
SamWheating opened this issue Feb 7, 2022 · 8 comments
Assignees
Labels
area:core kind:bug This is a clearly a bug

Comments

@SamWheating
Copy link
Contributor

SamWheating commented Feb 7, 2022

Apache Airflow version

2.2.2

What happened

When we upgraded from Airflow 2.1.2 to 2.2.2 we noticed that our MySQL instance was running near 100% CPU utilization while it was closer to 20% on Airflow 2.1.2.

After a long investigation, we found that the queries introduced in #17121 are extremely hard on the database, especially when there are multiple schedulers, a high value of parsing_processes and a large number of DAGs.

Our setup is as follows:

  • 4x Schedulers
  • ~20k rows in the dag table
  • ~10k DAG files
  • parsing_processes=32

So each time a DAG is parsed, a query like this will be run:

UPDATE dag
SET is_active=0
WHERE dag.fileloc = '/path/to_my/dag.py'
AND dag.is_active = 1
AND dag.dag_id NOT IN ('my-dag-1', 'my-dag-2');

And because dag isn't indexed on fileloc, the query ends up doing a full-table-scan (or nearly a full-table-scan), and this is repeated for every single file which is processed.

When I added these queries, I tested the change in a local breeze environment with a relatively small number of DAGs and thus did not notice the performance implications.

At our scale / configuration, we have approximately 128 of these poorly-performant queries running in parallel, each scanning approximately 20,000 rows. Understandably this was really hard on the database which ended up drastically impacting the performance of other queries.

We were able to reduce the impact by lowering the parsing_processes, cleaning up old entries in the dag table and increasing the min_file_processing_interval, but none of these mitigations really address the root of the problem.

We are currently working on a fix which moves this cleanup to the DAG Processor manager and eliminates un-indexed queries and should be able to submit a preliminary pull request for review in the next few days.

What you expected to happen

Removing DAGs which no longer exist in files should not put so much strain on the database.

How to reproduce

Use a high number of parsing_processes and/or scheduler replicas in an Airflow 2.2+ Environment with many DAGs.

Operating System

Debian GNU/Linux 10 (buster)

Deployment

Other 3rd-party Helm chart

Deployment details

Airflow 2.2.2 on Kubernetes
MySQL 8

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@SamWheating SamWheating added area:core kind:bug This is a clearly a bug labels Feb 7, 2022
@ashb
Copy link
Member

ashb commented Feb 7, 2022

It sounds like this would be fixed by adding an index on dag model table, right?

@SamWheating
Copy link
Contributor Author

Yes, but we can't add an index on the fileloc column because the column size is too large (I think its a VARCHAR(2000) or something)

@ashb
Copy link
Member

ashb commented Feb 7, 2022

Oh thanks Mysql.

Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables.

@SamWheating
Copy link
Contributor Author

Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables.

In this case would it make sense to add a fileloc_hash column to the DAG table? I can also push my proposed alternate implementation shortly.

@ashb
Copy link
Member

ashb commented Feb 7, 2022

It might, but if you have another solution that feels less hacky (as I don't like the "split" column all that much) lets take a look

@potiuk
Copy link
Member

potiuk commented Feb 7, 2022

Oh thanks Mysql.

A lot.

@SamWheating
Copy link
Contributor Author

Now that one fix has been proposed (#21399) and I have validated it in our Airflow deployments, what do you think about the proposed implementation? Any preferences between:

  1. Adding an indexed fileloc_hash to the DAG table and refactoring the existing code to use this index.
  2. bulk-deactivating deactivating files from the processor manager (as proposed above)

I am happy to help implement either change but would really like some careful review and discussion of the proposed methods.

@SamWheating
Copy link
Contributor Author

Closed in #21399

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

3 participants