Allow for retry when tasks are stuck in queued #43520

dimberman · 2024-10-30T17:24:22Z

Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses
track of a task, a cluster can't further scale up its workers, etc.), but tasks
should not be stuck in queued for a long time.

Originally, we simply marked a task as failed when it was stuck in queued for
too long. We found that this led to suboptimal outcomes as ideally we would like "failed"
to mean that a task was unable to run, instead of it meaning that we were unable to run the task.

As a compromise between always failing a stuck task and always rescheduling a stuck task (which could
lead to tasks being stuck in queued forever without informing the user), we have creating the config
[core] num_stuck_reschedules. With this new configuration, an airflow admin can decide how
sensitive they would like their airflow to be WRT failing stuck tasks.

Here is an example of what it looks like after trying this out with celery executor

airflow/jobs/scheduler_job_runner.py

…ed_timeout`. Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses track of a task, a cluster can't further scale up its workers, etc.), but tasks should not be stuck in queued for a long time. Originally, we simply marked a task as failed when it was stuck in queued for too long. We found that this led to suboptimal outcomes as ideally we would like "failed" to mean that a task was unable to run, instead of it meaning that we were unable to run the task. As a compromise between always failing a stuck task and always rescheduling a stuck task (which could lead to tasks being stuck in queued forever without informing the user), we have creating the config `AIRFLOW__CORE__NUM_STUCK_RETRIES`. With this new configuration, an airflow admin can decide how sensitive they would like their airflow to be WRT failing stuck tasks.

airflow/jobs/scheduler_job_runner.py

providers/src/airflow/providers/celery/executors/celery_executor.py

uv.lock

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

…s since they will not have the scheduler change

…flow into handle-stuck-in-queue

airflow/jobs/scheduler_job_runner.py

…flow into handle-stuck-in-queue

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

airflow/config_templates/config.yml

…ed_timeout`. Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses track of a task, a cluster can't further scale up its workers, etc.), but tasks should not be stuck in queued for a long time. Originally, we simply marked a task as failed when it was stuck in queued for too long. We found that this led to suboptimal outcomes as ideally we would like "failed" to mean that a task was unable to run, instead of it meaning that we were unable to run the task. As a compromise between always failing a stuck task and always rescheduling a stuck task (which could lead to tasks being stuck in queued forever without informing the user), we have creating the config `AIRFLOW__CORE__NUM_STUCK_RETRIES`. With this new configuration, an airflow admin can decide how sensitive they would like their airflow to be WRT failing stuck tasks.

…flow into handle-stuck-in-queue

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

airflow/jobs/scheduler_job_runner.py

providers/src/airflow/providers/celery/executors/celery_executor.py

airflow/executors/base_executor.py

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- (cherry picked from commit a41feeb) Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

…44158) * [v2-10-test] Re-queue tassk when they are stuck in queued (#43520) The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- (cherry picked from commit a41feeb) Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * fix test_handle_stuck_queued_tasks_multiple_attempts (#44093) --------- Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: GPK <gopidesupavan@gmail.com>

This is a fix up / followup to #43520 It does not really make a material difference, just, I'm avoiding use of the session decorator, and the create / dispose session logic, when it is not needed. i also commit as i go along since there's no reason to handle multiple distinct tis in the same transaction.

…44158) * [v2-10-test] Re-queue tassk when they are stuck in queued (#43520) The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- (cherry picked from commit a41feeb) Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * fix test_handle_stuck_queued_tasks_multiple_attempts (#44093) --------- Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: GPK <gopidesupavan@gmail.com>

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

) This is a fix up / followup to apache#43520 It does not really make a material difference, just, I'm avoiding use of the session decorator, and the create / dispose session logic, when it is not needed. i also commit as i go along since there's no reason to handle multiple distinct tis in the same transaction.

boring-cyborg bot added the area:Scheduler label Oct 30, 2024

dstandish reviewed Oct 30, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

o-nikolas reviewed Oct 30, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

dstandish reviewed Oct 30, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

dstandish reviewed Oct 30, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

dimberman force-pushed the handle-stuck-in-queue branch from 1f8b642 to 8eb60b1 Compare November 1, 2024 16:22

address feedback

Loading
Loading status checks…

066f672

jedcunningham reviewed Nov 1, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

providers/src/airflow/providers/celery/executors/celery_executor.py Show resolved Hide resolved

uv.lock Outdated Show resolved Hide resolved

dimberman and others added 4 commits November 1, 2024 10:42

remove uv.lock

Loading
Loading status checks…

93cb9ce

Update airflow/jobs/scheduler_job_runner.py

Loading
Loading status checks…

4a5c9e5

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

We need to ensure that older versions of airflow don't run into issue…

076c257

…s since they will not have the scheduler change

Merge branch 'handle-stuck-in-queue' of https://github.com/apache/air…

Loading
Loading status checks…

f46d4b0

…flow into handle-stuck-in-queue

jedcunningham reviewed Nov 1, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

dimberman marked this pull request as ready for review November 1, 2024 20:47

dimberman requested review from hussein-awala, ashb and XD-DENG as code owners November 1, 2024 20:47

dimberman and others added 6 commits November 1, 2024 13:47

Merge branch 'main' into handle-stuck-in-queue

Loading
Loading status checks…

4fe701e

address feedback

9987963

Merge branch 'handle-stuck-in-queue' of https://github.com/apache/air…

Loading
Loading status checks…

c39e52f

…flow into handle-stuck-in-queue

pre-commit

Loading
Loading status checks…

05fc02d

Update airflow/jobs/scheduler_job_runner.py

Loading
Loading status checks…

a1564cc

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

Merge branch 'main' into handle-stuck-in-queue

Loading
Loading status checks…

cbc3453

jedcunningham reviewed Nov 4, 2024

View reviewed changes

airflow/config_templates/config.yml Outdated Show resolved Hide resolved

dimberman and others added 3 commits November 4, 2024 09:13

Merge branch 'handle-stuck-in-queue' of https://github.com/apache/air…

1c34eea

…flow into handle-stuck-in-queue

Update airflow/config_templates/config.yml

Loading
Loading status checks…

2113384

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

dstandish mentioned this pull request Nov 4, 2024

Simplify the handle stuck in queued interface #43647

Merged

k8s support

c18f8f4

dstandish added 10 commits November 14, 2024 13:24

Merge branch 'main' into handle-stuck-in-queue

5dce100

Change this from "clean up stuck queued" to "revoke_task"

c4f4204

update tests

Loading
Loading status checks…

4bd5dbe

fix docstring

Loading
Loading status checks…

37b765a

docstring

Loading
Loading status checks…

4a503eb

make the param undocumented

Loading
Loading status checks…

fe8afe0

fix fallback

Loading
Loading status checks…

ea1faf9

ensure task removed from running / queued

f5eba8c

update test

Loading
Loading status checks…

14b233e

fix test mistake

Loading
Loading status checks…

3aa1f37

jedcunningham reviewed Nov 15, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

providers/src/airflow/providers/celery/executors/celery_executor.py Show resolved Hide resolved

jedcunningham approved these changes Nov 15, 2024

View reviewed changes

small nits

Loading
Loading status checks…

1d85dbf

dstandish merged commit a41feeb into main Nov 16, 2024
73 checks passed

jscheffl reviewed Nov 16, 2024

View reviewed changes

airflow/executors/base_executor.py Show resolved Hide resolved

jscheffl mentioned this pull request Nov 18, 2024

[v2-10-test] Re-queue tassk when they are stuck in queued (#43520) #44158

Merged

jedcunningham deleted the handle-stuck-in-queue branch November 18, 2024 21:24

dstandish mentioned this pull request Nov 19, 2024

Don't create new session in stuck queue reschedule handler #44192

Merged

dstandish added this to the Airflow 2.10.4 milestone Nov 19, 2024

eladkal mentioned this pull request Nov 24, 2024

Status of testing Providers that were prepared on November 24, 2024 #44324

Closed

22 tasks

utkarsharma2 mentioned this pull request Dec 10, 2024

Status of testing of Apache Airflow 2.10.4rc1 #44811

Closed

33 tasks

karenbraganz mentioned this pull request Dec 23, 2024

Allow internal retries when pending k8s pod is deleted #45184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for retry when tasks are stuck in queued #43520

Allow for retry when tasks are stuck in queued #43520

dimberman commented Oct 30, 2024 •

edited by dstandish

Loading

Allow for retry when tasks are stuck in queued #43520

Allow for retry when tasks are stuck in queued #43520

Conversation

dimberman commented Oct 30, 2024 • edited by dstandish Loading

dimberman commented Oct 30, 2024 •

edited by dstandish

Loading