Allow failure callbacks for stuck in queued TIs that fail #53435

karenbraganz · 2025-07-16T20:10:38Z

In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them.

Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3.

I have created PR #53038 to address this issue separately in Airflow 2.11.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Nataneljpwd

Looks good!
Quick and simple solution which addresses the issue well

potiuk

LGTM - but others who are more executor-oriented can verify it @ashb @amoghrajesh @o-nikolas ?

airflow-core/src/airflow/jobs/scheduler_job_runner.py

karenbraganz · 2025-07-27T15:44:58Z

I was testing this PR out by repeatedly getting tasks stuck in queued. Initially, the task instances would get stuck in queued, fail, and the callbacks would run. After a few rounds of this, the below exception was raised:

According to the code, the session is not committed until after _maybe_requeue_stuck_ti completes running. Any idea what would cause the TaskInstance object to be detached from a session? As mentioned before, the code works as expected most of the time. This happens intermittently.

Nataneljpwd · 2025-07-28T07:16:23Z

@karenbraganz it could be due to the expundge happening here

https://github.com/apache/airflow/blob/main/airflow-core%2Fsrc%2Fairflow%2Fjobs%2Fscheduler_job_runner.py#L1286

This detaches all the instance objects from the session, I couldn't follow the flow as I did not have time but I think it might be related, or at least it is a good place to start investigating from.

karenbraganz · 2025-07-28T14:29:57Z

@Nataneljpwd The session being expunged in the link you provided is not the same session being used in _maybe_requeue_stuck_ti. The function _maybe_requeue_stuck_ti uses the session that is passed to it when it is called in _handle_tasks_stuck_in_queued. This session is created in _handle_tasks_stuck_in_queued and is not committed until after _maybe_requeue_stuck_ti is completed, so I did not expect the TaskInstance to be detached from the session in _maybe_requeue_stuck_ti. Please let me know if I am mistaken about something.

Nataneljpwd · 2025-07-28T16:32:41Z

@karenbraganz, I think that the issue might be that if there is more than 1 stuck task, it commits before it changed or requed all tasks, and so the other tasks are lost, and we can see the commit here, I would try moving the commit out of the loop and see if it changes anything, nevertheless, I would do it either way as we batch requests to the db that way which is better, but I think it might resolve the issue, thoug, I do not see that expire on commit is configured anywhere

karenbraganz · 2025-07-28T19:21:22Z

@Nataneljpwd I will try moving it out of the loop and run the tests again.

ashb · 2025-07-29T10:00:19Z

@karenbraganz A similar "detached instance error" was just fixed by Kaxil, but maybe we need another fix like #53838 (in a separate PR please)

karenbraganz · 2025-07-29T13:29:39Z

I moved session.commit() outside the loop and re-ran the repeated tests overnight. The first 45 task instances got stuck and failed as expected, but the 46th task instance that got stuck experienced the DetachedInstanceError again.

I will take a look at Kaxil's fix next.

karenbraganz · 2025-07-29T20:21:44Z

@ashb looks like eager loading the attributes is the solution. Still need to implement and test this out to confirm that it works.

Is there a reason why you want this in a separate PR? I would be eager loading attributes that are only used in this PR, so I thought it would be more appropriate to add that code in this PR itself.

For example, ti.dag_model.relative_fileloc is one of the attributes I would be eager loading, and this attribute is directly used in the code I am adding in this PR.

ashb · 2025-07-29T20:39:14Z

@karenbraganz Ah, if this PR is where you first access them, then yes making them eager load here is probably the right thing to do.

However one thing we need to consider is if eagerloading them is going to have performance impact on a rarely used case (I don't have the context on this path or PR to say either way), and if it might better to load it from the DB in some other way only when needed.

karenbraganz · 2025-08-01T00:45:43Z

Confirmed that eager loading resolves the issue. Working through the correct way to implement this since the attributes will only be used if the task is failed and has a failure callback.

karenbraganz · 2025-08-04T19:50:07Z

I was able to find a workaround for the DetachedInstanceError that does not involve eager loading. Right before the TaskCallbackRequest is created (where the error occurs), I added a condition to check whether the ti is detached. If it is, it will be merged into the session. I found that this prevents the issue from occurring without having to eager load any attributes.

I have implemented this in my latest commit. Please let me know if there are any objections to this workaround.

airflow-core/src/airflow/jobs/scheduler_job_runner.py

ashb

LGTM. We'll merge this and backport it for 3.0.5 (Was too late for 3.0.4, and even though we are having an RC2 I don't want to introduce more code changes than we have to in order to get that release out)

o-nikolas

Nice, this looks great now! Thanks for sticking with it :)

github-actions · 2025-08-12T10:01:28Z

Backport failed to create: v3-0-test. View the failure log Run details

Status	Branch	Result
❌	v3-0-test

You can attempt to backport this manually by running:

cherry_picker 6da77b1 v3-0-test

This should apply the commit to the v3-0-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

…#53435) In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3. (cherry picked from commit 6da77b1) Co-authored-by: Karen Braganza <karenbraganza15@gmail.com>

…#53435) (#54401) In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3. (cherry picked from commit 6da77b1) Co-authored-by: Karen Braganza <karenbraganza15@gmail.com>

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jul 16, 2025

karenbraganz self-assigned this Jul 16, 2025

karenbraganz marked this pull request as ready for review July 16, 2025 22:04

karenbraganz requested review from XD-DENG and ashb as code owners July 16, 2025 22:04

Nataneljpwd approved these changes Jul 17, 2025

View reviewed changes

karenbraganz added this to the Airflow 3.0.4 milestone Jul 18, 2025

potiuk approved these changes Jul 21, 2025

View reviewed changes

potiuk mentioned this pull request Jul 21, 2025

[v2-11-test] Allow failure callbacks for stuck in queued TIs that fail #53038

Open

amoghrajesh self-requested a review July 21, 2025 12:12

amoghrajesh reviewed Jul 21, 2025

View reviewed changes

airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

o-nikolas reviewed Jul 23, 2025

View reviewed changes

airflow-core/src/airflow/jobs/scheduler_job_runner.py Show resolved Hide resolved

karenbraganz changed the title ~~Call executor fail method for stuck in queued tasks~~ Allow failure callbacks for stuck in queued TIs that fail Jul 29, 2025

karenbraganz modified the milestones: Airflow 3.0.4, Airflow 3.0.5 Aug 5, 2025

ashb reviewed Aug 5, 2025

View reviewed changes

airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

ashb approved these changes Aug 6, 2025

View reviewed changes

ashb added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label Aug 6, 2025

o-nikolas approved these changes Aug 11, 2025

View reviewed changes

karenbraganz added 13 commits August 12, 2025 10:22

Use executor fail method for tasks stuck in queued

a2566d2

Modify tests for tasks stuck in queued

709b14e

Add callback request for TIs that fail after being stuck in queued

9c65072

Remove callback requests for retry callbacks

aec8574

Modify testing for direct callback requests

b13b878

Remove testing print statements

3eae6f5

Remove mention of retry callback in log

cee3f16

Add executor fail

cbabee1

Replace SchedulerDagBag get_dag method with get_dag_for_run

61d924d

Merge detached task instances

63acd12

Correct capitalization in logging

5becc4f

Remove logging for detached TIs

e712951

Put session.commit back in the loop

4e88a5b

ashb force-pushed the tqt-callbacks-af3 branch from ee70e81 to 4e88a5b Compare August 12, 2025 09:22

ashb merged commit 6da77b1 into apache:main Aug 12, 2025
57 checks passed

ashb mentioned this pull request Aug 12, 2025

[v3-0-test] Allow failure callbacks for stuck in queued TIs that fail (#53435) #54401

Merged

kaxil mentioned this pull request Aug 13, 2025

Status of testing of Apache Airflow 3.0.5rc1 & Task SDK 1.0.5rc1 #54476

Closed

54 tasks

Allow failure callbacks for stuck in queued TIs that fail #53435

Allow failure callbacks for stuck in queued TIs that fail #53435

Uh oh!

Conversation

karenbraganz commented Jul 16, 2025

Uh oh!

Nataneljpwd left a comment

Choose a reason for hiding this comment

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

karenbraganz commented Jul 27, 2025

Uh oh!

Nataneljpwd commented Jul 28, 2025

Uh oh!

karenbraganz commented Jul 28, 2025

Uh oh!

Nataneljpwd commented Jul 28, 2025

Uh oh!

karenbraganz commented Jul 28, 2025

Uh oh!

ashb commented Jul 29, 2025

Uh oh!

karenbraganz commented Jul 29, 2025

Uh oh!

karenbraganz commented Jul 29, 2025

Uh oh!

ashb commented Jul 29, 2025

Uh oh!

karenbraganz commented Aug 1, 2025

Uh oh!

karenbraganz commented Aug 4, 2025

Uh oh!

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

o-nikolas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Aug 12, 2025

Backport failed to create: v3-0-test. View the failure log Run details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants