-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Add missing indexes on dag_version_id columns for db cleanup performance #60307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add missing indexes on dag_version_id columns for db cleanup performance #60307
Conversation
Adds indexes on task_instance.dag_version_id and dag_run.created_dag_version_id to speed up the airflow db clean command when cleaning dag_version records. Without these indexes, the cleanup operation performs full table scans on both tables for each batch, causing ~6 minutes per batch on tables with 300K+ rows. With the indexes, the same operation completes in under 2 minutes total. Fixes apache#60145
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
- Renamed migration from 0098 to 0099 to chain after new upstream migration - Updated down_revision to e79fc784f145 (timetable migration) - Updated migrations-ref.rst with correct chain - Our migration 62fb1d0a1252 is now the new HEAD for 3.2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request addresses a performance issue with the airflow db clean -t dag_version command by adding database indexes on foreign key columns that reference dag_version.id. Without these indexes, the cleanup process was performing full table scans resulting in ~6 minutes per batch on tables with 300K+ rows.
Changes:
- Added migration 0099 to create indexes on
task_instance.dag_version_idanddag_run.created_dag_version_id - Updated ORM models (TaskInstance and DagRun) to include the new indexes in their
__table_args__ - Updated the database revision head mapping to point to the new migration
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| airflow-core/src/airflow/utils/db.py | Updated the 3.2.0 revision head to point to the new migration |
| airflow-core/src/airflow/models/taskinstance.py | Added index on dag_version_id column to table_args |
| airflow-core/src/airflow/models/dagrun.py | Added index on created_dag_version_id column to table_args |
| airflow-core/src/airflow/migrations/versions/0099_3_2_0_add_dag_version_id_indexes_for_db_cleanup.py | New migration that creates the indexes with proper upgrade/downgrade logic |
| airflow-core/docs/migrations-ref.rst | Updated migration reference documentation to include the new migration as head |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...core/src/airflow/migrations/versions/0099_3_2_0_add_dag_version_id_indexes_for_db_cleanup.py
Outdated
Show resolved
Hide resolved
…g_version_id_indexes_for_db_cleanup.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
What does this PR do?
Adds database indexes on
task_instance.dag_version_idanddag_run.created_dag_version_idto fix the slow performance ofairflow db clean -t dag_versioncommand.Why is this needed?
When running
airflow db clean -t dag_version --batch-size 1000, the cleanup process was taking ~6 minutes per batch on tables with 300K+ rows. The root cause was missing indexes on the foreign key columns that referencedag_version.id.The cleanup code in
db_cleanup.pydefinesdag_versionwithdependent_tables=["task_instance", "dag_run"], meaning every delete operation needs to check both tables for FK violations. Without indexes, PostgreSQL performs full table scans.What changed?
New migration (
0098_3_2_0_add_dag_version_id_indexes_for_db_cleanup.py):idx_task_instance_dag_version_idontask_instance(dag_version_id)idx_dag_run_created_dag_version_idondag_run(created_dag_version_id)Updated ORM models: Added indexes to
__table_args__for schema consistencyPerformance Impact
db clean -t dag_version(300K rows, batch=1000)Fixes #60145