Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs #21397

SamWheating · 2022-02-07T17:58:56Z

Apache Airflow version

2.2.2

What happened

When we upgraded from Airflow 2.1.2 to 2.2.2 we noticed that our MySQL instance was running near 100% CPU utilization while it was closer to 20% on Airflow 2.1.2.

After a long investigation, we found that the queries introduced in #17121 are extremely hard on the database, especially when there are multiple schedulers, a high value of parsing_processes and a large number of DAGs.

Our setup is as follows:

4x Schedulers
~20k rows in the dag table
~10k DAG files
parsing_processes=32

So each time a DAG is parsed, a query like this will be run:

UPDATE dag
SET is_active=0
WHERE dag.fileloc = '/path/to_my/dag.py'
AND dag.is_active = 1
AND dag.dag_id NOT IN ('my-dag-1', 'my-dag-2');

And because dag isn't indexed on fileloc, the query ends up doing a full-table-scan (or nearly a full-table-scan), and this is repeated for every single file which is processed.

When I added these queries, I tested the change in a local breeze environment with a relatively small number of DAGs and thus did not notice the performance implications.

At our scale / configuration, we have approximately 128 of these poorly-performant queries running in parallel, each scanning approximately 20,000 rows. Understandably this was really hard on the database which ended up drastically impacting the performance of other queries.

We were able to reduce the impact by lowering the parsing_processes, cleaning up old entries in the dag table and increasing the min_file_processing_interval, but none of these mitigations really address the root of the problem.

We are currently working on a fix which moves this cleanup to the DAG Processor manager and eliminates un-indexed queries and should be able to submit a preliminary pull request for review in the next few days.

What you expected to happen

Removing DAGs which no longer exist in files should not put so much strain on the database.

How to reproduce

Use a high number of parsing_processes and/or scheduler replicas in an Airflow 2.2+ Environment with many DAGs.

Operating System

Debian GNU/Linux 10 (buster)

Deployment

Other 3rd-party Helm chart

Deployment details

Airflow 2.2.2 on Kubernetes
MySQL 8

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ashb · 2022-02-07T18:44:11Z

It sounds like this would be fixed by adding an index on dag model table, right?

SamWheating · 2022-02-07T18:46:57Z

Yes, but we can't add an index on the fileloc column because the column size is too large (I think its a VARCHAR(2000) or something)

ashb · 2022-02-07T18:49:16Z

Oh thanks Mysql.

Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables.

SamWheating · 2022-02-07T18:59:03Z

Ah yes, that's why we've already got the fileloc_hash column on DagCode and SerializedDag tables.

In this case would it make sense to add a fileloc_hash column to the DAG table? I can also push my proposed alternate implementation shortly.

ashb · 2022-02-07T18:59:58Z

It might, but if you have another solution that feels less hacky (as I don't like the "split" column all that much) lets take a look

potiuk · 2022-02-07T20:30:25Z

Oh thanks Mysql.

A lot.

SamWheating · 2022-02-23T22:17:47Z

Now that one fix has been proposed (#21399) and I have validated it in our Airflow deployments, what do you think about the proposed implementation? Any preferences between:

Adding an indexed fileloc_hash to the DAG table and refactoring the existing code to use this index.
bulk-deactivating deactivating files from the processor manager (as proposed above)

I am happy to help implement either change but would really like some careful review and discussion of the proposed methods.

SamWheating · 2022-03-24T18:08:20Z

Closed in #21399

SamWheating added area:core kind:bug This is a clearly a bug labels Feb 7, 2022

ashb assigned SamWheating Feb 7, 2022

SamWheating mentioned this issue Feb 7, 2022

Reduce DB load incurred by Stale DAG deactivation #21399

Merged

2 tasks

SamWheating closed this as completed Mar 24, 2022

SamWheating mentioned this issue May 11, 2022

remove stale serialized dags #22917

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs #21397

Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs #21397

SamWheating commented Feb 7, 2022 •

edited

Loading

ashb commented Feb 7, 2022

SamWheating commented Feb 7, 2022

ashb commented Feb 7, 2022 •

edited

Loading

SamWheating commented Feb 7, 2022

ashb commented Feb 7, 2022

potiuk commented Feb 7, 2022

SamWheating commented Feb 23, 2022

SamWheating commented Mar 24, 2022

Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs #21397

Stale DAG Deactivation in DAG Processor is extremely hard on the database in environments with many DAGs #21397

Comments

SamWheating commented Feb 7, 2022 • edited Loading

Apache Airflow version

What happened

What you expected to happen

How to reproduce

Operating System

Deployment

Deployment details

Are you willing to submit PR?

Code of Conduct

ashb commented Feb 7, 2022

SamWheating commented Feb 7, 2022

ashb commented Feb 7, 2022 • edited Loading

SamWheating commented Feb 7, 2022

ashb commented Feb 7, 2022

potiuk commented Feb 7, 2022

SamWheating commented Feb 23, 2022

SamWheating commented Mar 24, 2022

SamWheating commented Feb 7, 2022 •

edited

Loading

ashb commented Feb 7, 2022 •

edited

Loading