Skip to content

Conversation

@ephraimbuddy
Copy link
Contributor

When two schedulers run concurrently, both could start more backfill dag runs than max_active_runs allows. This happened because each scheduler read the count of running dag runs before either committed, causing both to see stale counts and start runs simultaneously.

The fix adds row-level locking on the Backfill table. When a scheduler processes backfill dag runs, it first locks the relevant Backfill rows. If another scheduler already holds the lock, the current scheduler skips those backfills rather than potentially violating the max_active_runs constraint.

This ensures that only one scheduler can process a given backfill's dag runs at a time, preventing the race condition while remaining non-blocking (schedulers don't wait on each other).

(cherry picked from commit 22af27e)

#58807)

* Fix backfill max_active_runs race condition with concurrent schedulers

When two schedulers run concurrently, both could start more backfill
dag runs than max_active_runs allows. This happened because each
scheduler read the count of running dag runs before either committed,
causing both to see stale counts and start runs simultaneously.

The fix adds row-level locking on the Backfill table. When a scheduler
processes backfill dag runs, it first locks the relevant Backfill rows.
If another scheduler already holds the lock, the current scheduler skips
those backfills rather than potentially violating the max_active_runs
constraint.

This ensures that only one scheduler can process a given backfill's
dag runs at a time, preventing the race condition while remaining
non-blocking (schedulers don't wait on each other).

(cherry picked from commit 22af27e)
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Dec 2, 2025
@ephraimbuddy ephraimbuddy requested review from Lee-W, ashb, potiuk and vatsrahul1001 and removed request for XD-DENG and ashb December 2, 2025 11:19
@Lee-W Lee-W merged commit ca9ee6f into v3-1-test Dec 2, 2025
62 of 63 checks passed
@Lee-W Lee-W deleted the backport-22af27e-v3-1-test branch December 2, 2025 12:17
@ephraimbuddy ephraimbuddy added this to the Airflow 3.1.4 milestone Dec 2, 2025
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants