Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for retry when tasks are stuck in queued #43520

Merged
merged 66 commits into from
Nov 16, 2024
Merged

Conversation

dimberman
Copy link
Contributor

@dimberman dimberman commented Oct 30, 2024

Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses
track of a task, a cluster can't further scale up its workers, etc.), but tasks
should not be stuck in queued for a long time.

Originally, we simply marked a task as failed when it was stuck in queued for
too long. We found that this led to suboptimal outcomes as ideally we would like "failed"
to mean that a task was unable to run, instead of it meaning that we were unable to run the task.

As a compromise between always failing a stuck task and always rescheduling a stuck task (which could
lead to tasks being stuck in queued forever without informing the user), we have creating the config
[core] num_stuck_reschedules. With this new configuration, an airflow admin can decide how
sensitive they would like their airflow to be WRT failing stuck tasks.

Here is an example of what it looks like after trying this out with celery executor

image

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Oct 30, 2024
…ed_timeout`.

Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses
track of a task, a cluster can't further scale up its workers, etc.), but tasks
should not be stuck in queued for a long time.

Originally, we simply marked a task as failed when it was stuck in queued for
too long. We found that this led to suboptimal outcomes as ideally we would like "failed"
to mean that a task was unable to run, instead of it meaning that we were unable to run the task.

As a compromise between always failing a stuck task and always rescheduling a stuck task (which could
lead to tasks being stuck in queued forever without informing the user), we have creating the config
`AIRFLOW__CORE__NUM_STUCK_RETRIES`. With this new configuration, an airflow admin can decide how
sensitive they would like their airflow to be WRT failing stuck tasks.
@dimberman dimberman force-pushed the handle-stuck-in-queue branch from 1f8b642 to 8eb60b1 Compare November 1, 2024 16:22
airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved
airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved
uv.lock Outdated Show resolved Hide resolved
dimberman and others added 4 commits November 1, 2024 10:42
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
…s since they will not have the scheduler change
…flow into handle-stuck-in-queue
airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved
airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved
@dimberman dimberman marked this pull request as ready for review November 1, 2024 20:47
dimberman and others added 6 commits November 1, 2024 13:47
…flow into handle-stuck-in-queue
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
dimberman and others added 3 commits November 4, 2024 09:13
…ed_timeout`.

Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses
track of a task, a cluster can't further scale up its workers, etc.), but tasks
should not be stuck in queued for a long time.

Originally, we simply marked a task as failed when it was stuck in queued for
too long. We found that this led to suboptimal outcomes as ideally we would like "failed"
to mean that a task was unable to run, instead of it meaning that we were unable to run the task.

As a compromise between always failing a stuck task and always rescheduling a stuck task (which could
lead to tasks being stuck in queued forever without informing the user), we have creating the config
`AIRFLOW__CORE__NUM_STUCK_RETRIES`. With this new configuration, an airflow admin can decide how
sensitive they would like their airflow to be WRT failing stuck tasks.
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@dstandish dstandish merged commit a41feeb into main Nov 16, 2024
73 checks passed
jscheffl pushed a commit to jscheffl/airflow that referenced this pull request Nov 18, 2024
The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@jedcunningham jedcunningham deleted the handle-stuck-in-queue branch November 18, 2024 21:24
kandharvishnu pushed a commit to kandharvishnu/airflow that referenced this pull request Nov 19, 2024
The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@dstandish dstandish added this to the Airflow 2.10.4 milestone Nov 19, 2024
jscheffl added a commit that referenced this pull request Nov 19, 2024
…44158)

* [v2-10-test] Re-queue tassk when they are stuck in queued (#43520)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* fix test_handle_stuck_queued_tasks_multiple_attempts (#44093)

---------

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: GPK <gopidesupavan@gmail.com>
dstandish added a commit that referenced this pull request Nov 19, 2024
This is a fix up / followup to #43520

It does not really make a material difference, just, I'm avoiding use of the session decorator, and the create / dispose session logic, when it is not needed. i also commit as i go along since there's no reason to handle multiple distinct tis in the same transaction.
utkarsharma2 pushed a commit that referenced this pull request Dec 4, 2024
…44158)

* [v2-10-test] Re-queue tassk when they are stuck in queued (#43520)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* fix test_handle_stuck_queued_tasks_multiple_attempts (#44093)

---------

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: GPK <gopidesupavan@gmail.com>
utkarsharma2 pushed a commit that referenced this pull request Dec 9, 2024
…44158)

* [v2-10-test] Re-queue tassk when they are stuck in queued (#43520)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* fix test_handle_stuck_queued_tasks_multiple_attempts (#44093)

---------

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: GPK <gopidesupavan@gmail.com>
LefterisXefteris pushed a commit to LefterisXefteris/airflow that referenced this pull request Jan 5, 2025
The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
LefterisXefteris pushed a commit to LefterisXefteris/airflow that referenced this pull request Jan 5, 2025
)

This is a fix up / followup to apache#43520

It does not really make a material difference, just, I'm avoiding use of the session decorator, and the create / dispose session logic, when it is not needed. i also commit as i go along since there's no reason to handle multiple distinct tis in the same transaction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants