-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Closed
Labels
affected_version:3.1Issues Reported for 3.1Issues Reported for 3.1area:Schedulerincluding HA (high availability) schedulerincluding HA (high availability) schedulerarea:backfillSpecifically for backfill relatedSpecifically for backfill relatedarea:corekind:bugThis is a clearly a bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new releaseHigh priority bug that should be patched quickly but does not require immediate new release
Description
Apache Airflow version
3.1.3
If "Other Airflow 2/3 version" selected, which one?
No response
What happened?
While testing a backfill job in an HA Airflow deployment using the default example_astronaut DAG (schedule_interval = 0 0 * * *), I observed two issues:
- Incorrect backfill duration when the number of runs is large (>500).
The backfill job shows an incorrect duration. This appears to be caused by a race condition where the system checks for queued or running DAG runs before those runs are committed to the database. As a result, it prematurely assumes that no runs exist and closes the backfill job.
- Concurrent runs in HA mode even with max_active_runs=1.
During a large backfill (1,059 runs), I observed multiple DAG runs executing concurrently, even though the DAG’s max_active_runs was explicitly set to 1. This behaviour was not seen in live DAG runs, only in large-scale backfills.
After all runs were completed, I verified the overlap by querying the metadata database with the following SQL:
SELECT
r1.run_id AS run1,
r2.run_id AS run2,
r1.start_date AS run1_start,
r1.end_date AS run1_end,
r2.start_date AS run2_start,
r2.end_date AS run2_end,
EXTRACT(EPOCH FROM (r2.start_date - r1.start_date)) AS time_diff_seconds,
EXTRACT(EPOCH FROM (
LEAST(r1.end_date, r2.end_date) -
GREATEST(r1.start_date, r2.start_date)
)) AS overlap_seconds
FROM dag_run r1
JOIN dag_run r2 ON r1.backfill_id = r2.backfill_id AND r1.id < r2.id
WHERE r1.backfill_id = :backfill_id
AND r1.start_date IS NOT NULL
AND r1.end_date IS NOT NULL
AND r2.start_date IS NOT NULL
AND r2.end_date IS NOT NULL
AND r1.start_date < r2.end_date
AND r2.start_date < r1.end_date
ORDER BY r1.start_date;
run1 |run2 |run1_start |run1_end |run2_start |run2_end |time_diff_seconds|overlap_seconds|
-----------------------------------+-----------------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------+---------------+
backfill__2024-02-08T00:00:00+00:00|backfill__2024-02-28T00:00:00+00:00|2025-11-25 10:22:20.077 +0000|2025-11-25 10:22:24.808 +0000|2025-11-25 10:22:20.077 +0000|2025-11-25 10:22:26.313 +0000| 0.000088| 4.730966|
backfill__2024-03-14T00:00:00+00:00|backfill__2024-04-03T00:00:00+00:00|2025-11-25 10:26:07.912 +0000|2025-11-25 10:26:12.472 +0000|2025-11-25 10:26:07.914 +0000|2025-11-25 10:26:12.482 +0000| 0.002002| 4.557969|
backfill__2024-12-06T00:00:00+00:00|backfill__2024-12-26T00:00:00+00:00|2025-11-25 11:00:41.147 +0000|2025-11-25 11:00:45.795 +0000|2025-11-25 11:00:41.155 +0000|2025-11-25 11:00:45.084 +0000| 0.008247| 3.929261|
backfill__2024-12-20T00:00:00+00:00|backfill__2025-01-10T00:00:00+00:00|2025-11-25 11:02:46.481 +0000|2025-11-25 11:02:50.366 +0000|2025-11-25 11:02:46.482 +0000|2025-11-25 11:03:04.365 +0000| 0.000780| 3.884634|
backfill__2024-12-28T00:00:00+00:00|backfill__2025-01-18T00:00:00+00:00|2025-11-25 11:03:37.608 +0000|2025-11-25 11:04:11.477 +0000|2025-11-25 11:03:37.616 +0000|2025-11-25 11:03:39.634 +0000| 0.008725| 2.017774|
backfill__2025-03-08T00:00:00+00:00|backfill__2025-03-28T00:00:00+00:00|2025-11-25 11:13:15.325 +0000|2025-11-25 11:13:20.388 +0000|2025-11-25 11:13:15.330 +0000|2025-11-25 11:13:19.314 +0000| 0.004477| 3.984624|
backfill__2025-04-01T00:00:00+00:00|backfill__2025-04-21T00:00:00+00:00|2025-11-25 11:16:16.014 +0000|2025-11-25 11:16:21.355 +0000|2025-11-25 11:16:16.024 +0000|2025-11-25 11:16:20.349 +0000| 0.010557| 4.324642|
backfill__2025-04-11T00:00:00+00:00|backfill__2025-05-02T00:00:00+00:00|2025-11-25 11:17:41.027 +0000|2025-11-25 11:18:00.220 +0000|2025-11-25 11:17:41.027 +0000|2025-11-25 11:18:00.230 +0000| 0.000231| 19.192360|
backfill__2025-05-14T00:00:00+00:00|backfill__2025-06-03T00:00:00+00:00|2025-11-25 11:21:03.877 +0000|2025-11-25 11:21:12.330 +0000|2025-11-25 11:21:03.877 +0000|2025-11-25 11:21:12.341 +0000| -0.000384| 8.452912|
backfill__2025-09-21T00:00:00+00:00|backfill__2025-10-11T00:00:00+00:00|2025-11-25 11:37:37.558 +0000|2025-11-25 11:37:41.712 +0000|2025-11-25 11:37:37.559 +0000|2025-11-25 11:37:41.723 +0000| 0.001120| 4.152781|
backfill__2025-10-12T00:00:00+00:00|backfill__2025-11-02T00:00:00+00:00|2025-11-25 11:41:10.639 +0000|2025-11-25 11:41:14.470 +0000|2025-11-25 11:41:10.643 +0000|2025-11-25 11:41:45.070 +0000| 0.004594| 3.827172|
This revealed 11 overlapping runs, though the count varies per backfill job.
I suspect the root cause is a race condition while querying or scheduling queued backfill runs in HA environments.
What you think should happen instead?
- The backfill duration should accurately reflect the total runtime, regardless of run count.
- In HA mode, max_active_runs=1 should be strictly enforced for backfills, ensuring that only one DAG run executes at a time.
How to reproduce
-
Deploy Airflow in High Availability (HA) mode.
-
Use the default
example_astronautDAG (or any DAG) withmax_active_runs=1and a schedule of0 0 * * *. -
Create a backfill job with a large time range — e.g.,
start_date = 2023-01-01 14:48:00 end_date = 2025-11-24 14:48:00(This produces ~1,058 runs.)
-
Observe:
- The backfill duration reported in the UI/logs is incorrect when run count > 500.
- Multiple DAG runs may execute concurrently even though
max_active_runs=1.
Operating System
debian
Versions of Apache Airflow Providers
No response
Deployment
Astronomer
Deployment details
No response
Anything else?
- This issue was observed on an HA setup (multiple schedulers).
- The behaviour appears to be race-condition dependent and inconsistent on smaller backfills.
- Increasing the backfill completion check frequency may mitigate the first issue.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
affected_version:3.1Issues Reported for 3.1Issues Reported for 3.1area:Schedulerincluding HA (high availability) schedulerincluding HA (high availability) schedulerarea:backfillSpecifically for backfill relatedSpecifically for backfill relatedarea:corekind:bugThis is a clearly a bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new releaseHigh priority bug that should be patched quickly but does not require immediate new release