Skip to content

Conversation

@ephraimbuddy
Copy link
Contributor

When two schedulers run concurrently, both could start more backfill
dag runs than max_active_runs allows. This happened because each
scheduler read the count of running dag runs before either committed,
causing both to see stale counts and start runs simultaneously.

The fix adds row-level locking on the Backfill table. When a scheduler
processes backfill dag runs, it first locks the relevant Backfill rows.
If another scheduler already holds the lock, the current scheduler skips
those backfills rather than potentially violating the max_active_runs
constraint.

This ensures that only one scheduler can process a given backfill's
dag runs at a time, preventing the race condition while remaining
non-blocking (schedulers don't wait on each other).

Related: #58752

When two schedulers run concurrently, both could start more backfill
dag runs than max_active_runs allows. This happened because each
scheduler read the count of running dag runs before either committed,
causing both to see stale counts and start runs simultaneously.

The fix adds row-level locking on the Backfill table. When a scheduler
processes backfill dag runs, it first locks the relevant Backfill rows.
If another scheduler already holds the lock, the current scheduler skips
those backfills rather than potentially violating the max_active_runs
constraint.

This ensures that only one scheduler can process a given backfill's
dag runs at a time, preventing the race condition while remaining
non-blocking (schedulers don't wait on each other).
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Nov 28, 2025
Co-authored-by: Wei Lee <weilee.rx@gmail.com>
@ephraimbuddy ephraimbuddy merged commit 22af27e into apache:main Dec 2, 2025
65 checks passed
@ephraimbuddy ephraimbuddy deleted the fix-backfill-max-active-runs branch December 2, 2025 11:11
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

Backport failed to create: v3-1-test. View the failure log Run details

Status Branch Result
v3-1-test Commit Link

You can attempt to backport this manually by running:

cherry_picker 22af27e v3-1-test

This should apply the commit to the v3-1-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

ephraimbuddy added a commit that referenced this pull request Dec 2, 2025
#58807)

* Fix backfill max_active_runs race condition with concurrent schedulers

When two schedulers run concurrently, both could start more backfill
dag runs than max_active_runs allows. This happened because each
scheduler read the count of running dag runs before either committed,
causing both to see stale counts and start runs simultaneously.

The fix adds row-level locking on the Backfill table. When a scheduler
processes backfill dag runs, it first locks the relevant Backfill rows.
If another scheduler already holds the lock, the current scheduler skips
those backfills rather than potentially violating the max_active_runs
constraint.

This ensures that only one scheduler can process a given backfill's
dag runs at a time, preventing the race condition while remaining
non-blocking (schedulers don't wait on each other).

(cherry picked from commit 22af27e)
RoyLee1224 pushed a commit to RoyLee1224/airflow that referenced this pull request Dec 3, 2025
apache#58807)

* Fix backfill max_active_runs race condition with concurrent schedulers

When two schedulers run concurrently, both could start more backfill
dag runs than max_active_runs allows. This happened because each
scheduler read the count of running dag runs before either committed,
causing both to see stale counts and start runs simultaneously.

The fix adds row-level locking on the Backfill table. When a scheduler
processes backfill dag runs, it first locks the relevant Backfill rows.
If another scheduler already holds the lock, the current scheduler skips
those backfills rather than potentially violating the max_active_runs
constraint.

This ensures that only one scheduler can process a given backfill's
dag runs at a time, preventing the race condition while remaining
non-blocking (schedulers don't wait on each other).
itayweb pushed a commit to itayweb/airflow that referenced this pull request Dec 6, 2025
apache#58807)

* Fix backfill max_active_runs race condition with concurrent schedulers

When two schedulers run concurrently, both could start more backfill
dag runs than max_active_runs allows. This happened because each
scheduler read the count of running dag runs before either committed,
causing both to see stale counts and start runs simultaneously.

The fix adds row-level locking on the Backfill table. When a scheduler
processes backfill dag runs, it first locks the relevant Backfill rows.
If another scheduler already holds the lock, the current scheduler skips
those backfills rather than potentially violating the max_active_runs
constraint.

This ensures that only one scheduler can process a given backfill's
dag runs at a time, preventing the race condition while remaining
non-blocking (schedulers don't wait on each other).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants