Skip to content

Requeues doesn't work properly with reschedule sensors #49971

@collinmcnulty

Description

@collinmcnulty

Apache Airflow version

2.10.5

If "Other Airflow 2 version" selected, which one?

No response

What happened?

A reschedule sensor was running for many hours. Over the course of that period, three of the "attempts" failed to go from queued to running in time. The task was then marked as failed by the requeue system (which bypasses retries).

What you think should happen instead?

When a reschedule sensor makes it to running, the requeue count should "reset". I only care if it failed to queue->running three times in a row. The current system assumes that a single try will only go from queued to running once, but reschedule sensors break that assumption.

How to reproduce

  1. set a reschedule sensor to run for a long time, that will never have its condition met
  2. use celery executor
  3. remove all workers in the task's queue for a brief period, so it fails to get to running
  4. return the workers, allowing it to get to running
  5. repeat 3 and 4 two more times

Operating System

debian bookworm

Versions of Apache Airflow Providers

Was using external task sensor in reschedule mode.

Deployment

Astronomer

Deployment details

No response

Anything else?

I believe the right fix for this is in _get_num_times_stuck_in_queued in scheduler_job_runner.py

First check if there are any running events, then if there are, filter the existing query to datetimes after the most recent running event.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:corekind:bugThis is a clearly a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions