Skip to content

Conversation

@karenbraganz
Copy link
Collaborator

@karenbraganz karenbraganz commented Jul 8, 2025

closes: #51301
In issues #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520. In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them.

Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@karenbraganz karenbraganz self-assigned this Jul 8, 2025
@kaxil
Copy link
Member

kaxil commented Jul 10, 2025

tests are failing:

FAILED tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_handle_stuck_queued_tasks_multiple_attempts - AssertionError: assert equals failed
  [            [           
    'queued',    'failed', 
    'queued',    'failed', 
  ]            ]
FAILED tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_handle_stuck_queued_tasks_reschedule_sensors - AssertionError: Expected 'fail' to not have been called. Called 2 times.
Calls: [call(TaskInstanceKey(dag_id='test_fail_stuck_queued_tasks_multiple_executors', task_id='op1', run_id='45e7135c-8e2c-4e07-8d35-4bea8c0232d6', try_number=0, map_index=-1)),
 call(TaskInstanceKey(dag_id='test_fail_stuck_queued_tasks_multiple_executors', task_id='op2', run_id='45e7135c-8e2c-4e07-8d35-4bea8c0232d6', try_number=0, map_index=-1))].

pytest introspection follows:

Args:
assert equals failed
  (                                ()                              
    TaskInstanceKey(dag_id='test_                                  
  fail_stuck_queued_tasks_multipl                                  
  e_executors', task_id='op2', ru                                  
  n_id='45e7135c-8e2c-4e07-8d35-4                                  
  bea8c0232d6', try_number=0, map                                  
  _index=-1),                                                      
  )
=========== 2 failed, 2035 passed, 959 skipped in 655.53s (0:10:55) ============

@eladkal eladkal added this to the Airflow 2.11.1 milestone Jul 11, 2025
@karenbraganz
Copy link
Collaborator Author

@kaxil all tests are passing now.

ti=ti,
session=session,
)
self._maybe_requeue_stuck_ti(ti=ti, session=session, executor=executor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do this in main, do you think that should be updated too? or is it 2.* specific? If latter, why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is implemented in separate PR #53435 @kaxil ..

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - providing that #53435 will also be merged :)

@kaxil
Copy link
Member

kaxil commented Jul 22, 2025

LGTM - providing that #53435 will also be merged :)

Yeah, let's get that one merged first

@karenbraganz karenbraganz changed the title [v2-11-test] Call executor fail method for stuck in queued tasks [v2-11-test] Allow failure callbacks for stuck in queued TIs that fail Aug 8, 2025
@eladkal
Copy link
Contributor

eladkal commented Sep 18, 2025

I see #53435 is merged

@karenbraganz
Copy link
Collaborator Author

@eladkal I had to make some changes there. Let me see if the changes are needed here as well.

@potiuk
Copy link
Member

potiuk commented Oct 13, 2025

@eladkal I had to make some changes there. Let me see if the changes are needed here as well.

@karenbraganz -> did you check it ?

@karenbraganz
Copy link
Collaborator Author

@potiuk There are a few changes that I need to make. Working on the changes now.

@karenbraganz
Copy link
Collaborator Author

I still need to run some tests to ensure my changes work as expected.

@karenbraganz
Copy link
Collaborator Author

Just need to figure out why one of the checks is failing. I think it might have something to do with the rich-click package version being used.

Everything else is completed.

@karenbraganz karenbraganz added the legacy ui Whether legacy UI change should be allowed in PR label Oct 27, 2025
@karenbraganz
Copy link
Collaborator Author

karenbraganz commented Oct 27, 2025

I had to make changes to airflow/www/static/js/types/api-generated.ts because one of the CI hooks was failing. My PR is not related to the UI.

@karenbraganz
Copy link
Collaborator Author

Going to close and re-open the PR to re-trigger tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

legacy ui Whether legacy UI change should be allowed in PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants