Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hung database transactions on xcom table #24124

Closed
2 tasks done
matthew-kruse opened this issue Jun 2, 2022 · 2 comments
Closed
2 tasks done

hung database transactions on xcom table #24124

matthew-kruse opened this issue Jun 2, 2022 · 2 comments
Labels
area:core invalid kind:bug This is a clearly a bug

Comments

@matthew-kruse
Copy link

Apache Airflow version

2.2.5

What happened

When the xcom table becomes large deletes like DELETE FROM xcom WHERE xcom.dag_id = %s AND xcom.task_id = %s AND xcom.execution_date = %s] have caused mysql transactions to hang indefinitely.

mysql> explain DELETE FROM xcom WHERE xcom.dag_id = '' AND xcom.task_id = '' AND xcom.execution_date = '2022-06-02 18:52:00';
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+
| 1 | DELETE | xcom | NULL | range | PRIMARY | PRIMARY | 1504 | const,const | 2590 | 100.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------------+------+----------+-------------+

What you think should happen instead

We manually fixed this by adding a new index to the xcom table designed to increase performance on this delete.

ALTER TABLE xcom ADD INDEX xcom_delete_idx (dag_id, task_id, execution_date)

After we added this index the deletes would typically scan 1 record to delete one record instead of having to scan 2500 records to delete 1 record.

How to reproduce

Mileage will vary depending on how much activity your airflow instance gets, but as you scale up dag runs you'd tend to get big enough record counts to potentially reproduce this performance slowdown. In our case, our xcom table has 500k entries in it when we first started to see this and can have in the millions of records. We typically blow away all data in our airflow database several times a year as airflow's ui will stop functioning all around when record counts in various tables get to large.

Operating System

ubuntu

Versions of Apache Airflow Providers

apache-airflow[mysql]==2.2.5
apache-airflow-providers-http==2.1.2

Deployment

Other Docker-based deployment

Deployment details

Nothing particularly special, we run airflow in a docker container. MySQL is run on dedicated gear not on the same machine that airflow resides on.

Anything else

This happens only as record counts in the various airflow tables get higher.

We've had to make a number of schema adjustments for performance reasons due to locking/failure of mysql to detect deadlocks properly. The #1 bottleneck for us on airflow is the database and due to our scale we've had to be aggressive about modifying airflow indexes to keep things going. If the airflow devs want help with database performance I'm happy to assist.

I am willing to submit a PR for this and other things performance related to the backing database if someone was willing to walk me through how to do that. I assume you are using some sort of database abstraction layer due to supporting multiple database backends. I've got 10+ years as a dba supporting mysql, postgres, and mongo. I've even used sqlite for a few projects so if you are interested in a database schema review and performance evaluation I'm happy to assist as it will ultimately make my airflow project better as well.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@matthew-kruse matthew-kruse added area:core kind:bug This is a clearly a bug labels Jun 2, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 2, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk
Copy link
Member

potiuk commented Jun 2, 2022

In aiflow 2.2 you should use run_id as index not execution date. This has changed in Airlfow 2.2 as part of AIP-39. Possibly you missed that in adjusting your queries.

Also In airflow 2.3.0 xcom and other tables were still heavily updated to include map_index.

I propose you test it with the right selection query (using run_id) and also make sure that you adjust your DB queries when you migrate to latest Airflow 2.3 (2.3.2 is likely out tomorrow) and test it there - DB in airflow is pretty much internal detail, so if you use any db queries, they might need adjustments with every migration.

Also you might be happy to learn that in Airflow 2.3 there is the new "airflow db clean" command that you can likely make use of to replace your manual database retention. https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#clean. This might solve your "maintenance" issue for custom queries, because it will be always using the right queries to clean old data and you can implement your own retention schedule and criteria.

@potiuk potiuk closed this as completed Jun 2, 2022
@potiuk potiuk added the invalid label Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core invalid kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

2 participants