-
Notifications
You must be signed in to change notification settings - Fork 16.6k
Description
Apache Airflow version
2.10.5
If "Other Airflow 2 version" selected, which one?
No response
What happened?
A reschedule sensor was running for many hours. Over the course of that period, three of the "attempts" failed to go from queued to running in time. The task was then marked as failed by the requeue system (which bypasses retries).
What you think should happen instead?
When a reschedule sensor makes it to running, the requeue count should "reset". I only care if it failed to queue->running three times in a row. The current system assumes that a single try will only go from queued to running once, but reschedule sensors break that assumption.
How to reproduce
- set a reschedule sensor to run for a long time, that will never have its condition met
- use celery executor
- remove all workers in the task's queue for a brief period, so it fails to get to running
- return the workers, allowing it to get to running
- repeat 3 and 4 two more times
Operating System
debian bookworm
Versions of Apache Airflow Providers
Was using external task sensor in reschedule mode.
Deployment
Astronomer
Deployment details
No response
Anything else?
I believe the right fix for this is in _get_num_times_stuck_in_queued in scheduler_job_runner.py
First check if there are any running events, then if there are, filter the existing query to datetimes after the most recent running event.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct