Skip to content

Backfill Jobs in HA Mode Show Incorrect Duration and Allow Concurrent Runs Despite max_active_runs=1 #58752

@manipatnam

Description

@manipatnam

Apache Airflow version

3.1.3

If "Other Airflow 2/3 version" selected, which one?

No response

What happened?

While testing a backfill job in an HA Airflow deployment using the default example_astronaut DAG (schedule_interval = 0 0 * * *), I observed two issues:

  1. Incorrect backfill duration when the number of runs is large (>500).
    The backfill job shows an incorrect duration. This appears to be caused by a race condition where the system checks for queued or running DAG runs before those runs are committed to the database. As a result, it prematurely assumes that no runs exist and closes the backfill job.
Image
  1. Concurrent runs in HA mode even with max_active_runs=1.
    During a large backfill (1,059 runs), I observed multiple DAG runs executing concurrently, even though the DAG’s max_active_runs was explicitly set to 1. This behaviour was not seen in live DAG runs, only in large-scale backfills.

After all runs were completed, I verified the overlap by querying the metadata database with the following SQL:

SELECT
    r1.run_id AS run1,
    r2.run_id AS run2,
    r1.start_date AS run1_start,
    r1.end_date AS run1_end,
    r2.start_date AS run2_start,
    r2.end_date AS run2_end,
    EXTRACT(EPOCH FROM (r2.start_date - r1.start_date)) AS time_diff_seconds,
    EXTRACT(EPOCH FROM (
        LEAST(r1.end_date, r2.end_date) -
        GREATEST(r1.start_date, r2.start_date)
    )) AS overlap_seconds
FROM dag_run r1
JOIN dag_run r2 ON r1.backfill_id = r2.backfill_id AND r1.id < r2.id
WHERE r1.backfill_id = :backfill_id
    AND r1.start_date IS NOT NULL
    AND r1.end_date IS NOT NULL
    AND r2.start_date IS NOT NULL
    AND r2.end_date IS NOT NULL
    AND r1.start_date < r2.end_date
    AND r2.start_date < r1.end_date
ORDER BY r1.start_date;
run1                               |run2                               |run1_start                   |run1_end                     |run2_start                   |run2_end                     |time_diff_seconds|overlap_seconds|
-----------------------------------+-----------------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------+---------------+
backfill__2024-02-08T00:00:00+00:00|backfill__2024-02-28T00:00:00+00:00|2025-11-25 10:22:20.077 +0000|2025-11-25 10:22:24.808 +0000|2025-11-25 10:22:20.077 +0000|2025-11-25 10:22:26.313 +0000|         0.000088|       4.730966|
backfill__2024-03-14T00:00:00+00:00|backfill__2024-04-03T00:00:00+00:00|2025-11-25 10:26:07.912 +0000|2025-11-25 10:26:12.472 +0000|2025-11-25 10:26:07.914 +0000|2025-11-25 10:26:12.482 +0000|         0.002002|       4.557969|
backfill__2024-12-06T00:00:00+00:00|backfill__2024-12-26T00:00:00+00:00|2025-11-25 11:00:41.147 +0000|2025-11-25 11:00:45.795 +0000|2025-11-25 11:00:41.155 +0000|2025-11-25 11:00:45.084 +0000|         0.008247|       3.929261|
backfill__2024-12-20T00:00:00+00:00|backfill__2025-01-10T00:00:00+00:00|2025-11-25 11:02:46.481 +0000|2025-11-25 11:02:50.366 +0000|2025-11-25 11:02:46.482 +0000|2025-11-25 11:03:04.365 +0000|         0.000780|       3.884634|
backfill__2024-12-28T00:00:00+00:00|backfill__2025-01-18T00:00:00+00:00|2025-11-25 11:03:37.608 +0000|2025-11-25 11:04:11.477 +0000|2025-11-25 11:03:37.616 +0000|2025-11-25 11:03:39.634 +0000|         0.008725|       2.017774|
backfill__2025-03-08T00:00:00+00:00|backfill__2025-03-28T00:00:00+00:00|2025-11-25 11:13:15.325 +0000|2025-11-25 11:13:20.388 +0000|2025-11-25 11:13:15.330 +0000|2025-11-25 11:13:19.314 +0000|         0.004477|       3.984624|
backfill__2025-04-01T00:00:00+00:00|backfill__2025-04-21T00:00:00+00:00|2025-11-25 11:16:16.014 +0000|2025-11-25 11:16:21.355 +0000|2025-11-25 11:16:16.024 +0000|2025-11-25 11:16:20.349 +0000|         0.010557|       4.324642|
backfill__2025-04-11T00:00:00+00:00|backfill__2025-05-02T00:00:00+00:00|2025-11-25 11:17:41.027 +0000|2025-11-25 11:18:00.220 +0000|2025-11-25 11:17:41.027 +0000|2025-11-25 11:18:00.230 +0000|         0.000231|      19.192360|
backfill__2025-05-14T00:00:00+00:00|backfill__2025-06-03T00:00:00+00:00|2025-11-25 11:21:03.877 +0000|2025-11-25 11:21:12.330 +0000|2025-11-25 11:21:03.877 +0000|2025-11-25 11:21:12.341 +0000|        -0.000384|       8.452912|
backfill__2025-09-21T00:00:00+00:00|backfill__2025-10-11T00:00:00+00:00|2025-11-25 11:37:37.558 +0000|2025-11-25 11:37:41.712 +0000|2025-11-25 11:37:37.559 +0000|2025-11-25 11:37:41.723 +0000|         0.001120|       4.152781|
backfill__2025-10-12T00:00:00+00:00|backfill__2025-11-02T00:00:00+00:00|2025-11-25 11:41:10.639 +0000|2025-11-25 11:41:14.470 +0000|2025-11-25 11:41:10.643 +0000|2025-11-25 11:41:45.070 +0000|         0.004594|       3.827172|

This revealed 11 overlapping runs, though the count varies per backfill job.
I suspect the root cause is a race condition while querying or scheduling queued backfill runs in HA environments.

What you think should happen instead?

  1. The backfill duration should accurately reflect the total runtime, regardless of run count.
  2. In HA mode, max_active_runs=1 should be strictly enforced for backfills, ensuring that only one DAG run executes at a time.

How to reproduce

  1. Deploy Airflow in High Availability (HA) mode.

  2. Use the default example_astronaut DAG (or any DAG) with max_active_runs=1 and a schedule of 0 0 * * *.

  3. Create a backfill job with a large time range — e.g.,

    start_date = 2023-01-01 14:48:00  
    end_date   = 2025-11-24 14:48:00
    

    (This produces ~1,058 runs.)

  4. Observe:

    • The backfill duration reported in the UI/logs is incorrect when run count > 500.
    • Multiple DAG runs may execute concurrently even though max_active_runs=1.

Operating System

debian

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else?

  1. This issue was observed on an HA setup (multiple schedulers).
  2. The behaviour appears to be race-condition dependent and inconsistent on smaller backfills.
  3. Increasing the backfill completion check frequency may mitigate the first issue.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

affected_version:3.1Issues Reported for 3.1area:Schedulerincluding HA (high availability) schedulerarea:backfillSpecifically for backfill relatedarea:corekind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions