You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the xcom table becomes large deletes like DELETE FROM xcom WHERE xcom.dag_id = %s AND xcom.task_id = %s AND xcom.execution_date = %s] have caused mysql transactions to hang indefinitely.
mysql> explain DELETE FROM xcom WHERE xcom.dag_id = '' AND xcom.task_id = '' AND xcom.execution_date = '2022-06-02 18:52:00';
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
| 1 | DELETE | xcom | NULL | range | PRIMARY | PRIMARY | 1504 | const,const | 2590 | 100.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
What you think should happen instead
We manually fixed this by adding a new index to the xcom table designed to increase performance on this delete.
ALTER TABLE xcom ADD INDEX xcom_delete_idx (dag_id, task_id, execution_date)
After we added this index the deletes would typically scan 1 record to delete one record instead of having to scan 2500 records to delete 1 record.
How to reproduce
Mileage will vary depending on how much activity your airflow instance gets, but as you scale up dag runs you'd tend to get big enough record counts to potentially reproduce this performance slowdown. In our case, our xcom table has 500k entries in it when we first started to see this and can have in the millions of records. We typically blow away all data in our airflow database several times a year as airflow's ui will stop functioning all around when record counts in various tables get to large.
Nothing particularly special, we run airflow in a docker container. MySQL is run on dedicated gear not on the same machine that airflow resides on.
Anything else
This happens only as record counts in the various airflow tables get higher.
We've had to make a number of schema adjustments for performance reasons due to locking/failure of mysql to detect deadlocks properly. The #1 bottleneck for us on airflow is the database and due to our scale we've had to be aggressive about modifying airflow indexes to keep things going. If the airflow devs want help with database performance I'm happy to assist.
I am willing to submit a PR for this and other things performance related to the backing database if someone was willing to walk me through how to do that. I assume you are using some sort of database abstraction layer due to supporting multiple database backends. I've got 10+ years as a dba supporting mysql, postgres, and mongo. I've even used sqlite for a few projects so if you are interested in a database schema review and performance evaluation I'm happy to assist as it will ultimately make my airflow project better as well.
In aiflow 2.2 you should use run_id as index not execution date. This has changed in Airlfow 2.2 as part of AIP-39. Possibly you missed that in adjusting your queries.
Also In airflow 2.3.0 xcom and other tables were still heavily updated to include map_index.
I propose you test it with the right selection query (using run_id) and also make sure that you adjust your DB queries when you migrate to latest Airflow 2.3 (2.3.2 is likely out tomorrow) and test it there - DB in airflow is pretty much internal detail, so if you use any db queries, they might need adjustments with every migration.
Also you might be happy to learn that in Airflow 2.3 there is the new "airflow db clean" command that you can likely make use of to replace your manual database retention. https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#clean. This might solve your "maintenance" issue for custom queries, because it will be always using the right queries to clean old data and you can implement your own retention schedule and criteria.
Apache Airflow version
2.2.5
What happened
When the xcom table becomes large deletes like DELETE FROM xcom WHERE xcom.dag_id = %s AND xcom.task_id = %s AND xcom.execution_date = %s] have caused mysql transactions to hang indefinitely.
mysql> explain DELETE FROM xcom WHERE xcom.dag_id = '' AND xcom.task_id = '' AND xcom.execution_date = '2022-06-02 18:52:00';
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
| 1 | DELETE | xcom | NULL | range | PRIMARY | PRIMARY | 1504 | const,const | 2590 | 100.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
What you think should happen instead
We manually fixed this by adding a new index to the xcom table designed to increase performance on this delete.
ALTER TABLE xcom ADD INDEX xcom_delete_idx (dag_id, task_id, execution_date)
After we added this index the deletes would typically scan 1 record to delete one record instead of having to scan 2500 records to delete 1 record.
How to reproduce
Mileage will vary depending on how much activity your airflow instance gets, but as you scale up dag runs you'd tend to get big enough record counts to potentially reproduce this performance slowdown. In our case, our xcom table has 500k entries in it when we first started to see this and can have in the millions of records. We typically blow away all data in our airflow database several times a year as airflow's ui will stop functioning all around when record counts in various tables get to large.
Operating System
ubuntu
Versions of Apache Airflow Providers
apache-airflow[mysql]==2.2.5
apache-airflow-providers-http==2.1.2
Deployment
Other Docker-based deployment
Deployment details
Nothing particularly special, we run airflow in a docker container. MySQL is run on dedicated gear not on the same machine that airflow resides on.
Anything else
This happens only as record counts in the various airflow tables get higher.
We've had to make a number of schema adjustments for performance reasons due to locking/failure of mysql to detect deadlocks properly. The #1 bottleneck for us on airflow is the database and due to our scale we've had to be aggressive about modifying airflow indexes to keep things going. If the airflow devs want help with database performance I'm happy to assist.
I am willing to submit a PR for this and other things performance related to the backing database if someone was willing to walk me through how to do that. I assume you are using some sort of database abstraction layer due to supporting multiple database backends. I've got 10+ years as a dba supporting mysql, postgres, and mongo. I've even used sqlite for a few projects so if you are interested in a database schema review and performance evaluation I'm happy to assist as it will ultimately make my airflow project better as well.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: