-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Allow failure callbacks for stuck in queued TIs that fail #53435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nataneljpwd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Quick and simple solution which addresses the issue well
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - but others who are more executor-oriented can verify it @ashb @amoghrajesh @o-nikolas ?
|
@karenbraganz it could be due to the expundge happening here This detaches all the instance objects from the session, I couldn't follow the flow as I did not have time but I think it might be related, or at least it is a good place to start investigating from. |
|
@Nataneljpwd The session being expunged in the link you provided is not the same session being used in |
|
@karenbraganz, I think that the issue might be that if there is more than 1 stuck task, it commits before it changed or requed all tasks, and so the other tasks are lost, and we can see the commit here, I would try moving the commit out of the loop and see if it changes anything, nevertheless, I would do it either way as we batch requests to the db that way which is better, but I think it might resolve the issue, thoug, I do not see that expire on commit is configured anywhere |
|
@Nataneljpwd I will try moving it out of the loop and run the tests again. |
|
@karenbraganz A similar "detached instance error" was just fixed by Kaxil, but maybe we need another fix like #53838 (in a separate PR please) |
|
I moved I will take a look at Kaxil's fix next. |
|
@ashb looks like eager loading the attributes is the solution. Still need to implement and test this out to confirm that it works. Is there a reason why you want this in a separate PR? I would be eager loading attributes that are only used in this PR, so I thought it would be more appropriate to add that code in this PR itself. For example, |
|
@karenbraganz Ah, if this PR is where you first access them, then yes making them eager load here is probably the right thing to do. However one thing we need to consider is if eagerloading them is going to have performance impact on a rarely used case (I don't have the context on this path or PR to say either way), and if it might better to load it from the DB in some other way only when needed. |
|
Confirmed that eager loading resolves the issue. Working through the correct way to implement this since the attributes will only be used if the task is failed and has a failure callback. |
|
I was able to find a workaround for the I have implemented this in my latest commit. Please let me know if there are any objections to this workaround. |
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We'll merge this and backport it for 3.0.5 (Was too late for 3.0.4, and even though we are having an RC2 I don't want to introduce more code changes than we have to in order to get that release out)
o-nikolas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this looks great now! Thanks for sticking with it :)
ee70e81 to
4e88a5b
Compare
Backport failed to create: v3-0-test. View the failure log Run details
You can attempt to backport this manually by running: cherry_picker 6da77b1 v3-0-testThis should apply the commit to the v3-0-test branch and leave the commit in conflict state marking After you have resolved the conflicts, you can continue the backport process by running: cherry_picker --continue |
…#53435) In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3. (cherry picked from commit 6da77b1) Co-authored-by: Karen Braganza <karenbraganza15@gmail.com>
…#53435) In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3. (cherry picked from commit 6da77b1) Co-authored-by: Karen Braganza <karenbraganza15@gmail.com>
…#53435) In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3. (cherry picked from commit 6da77b1) Co-authored-by: Karen Braganza <karenbraganza15@gmail.com>
…#53435) (#54401) In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3. (cherry picked from commit 6da77b1) Co-authored-by: Karen Braganza <karenbraganza15@gmail.com>

In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them.
Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR
ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3.I have created PR #53038 to address this issue separately in Airflow 2.11.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.