Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send explicit task logs when marking tasks stuck in queued as failed #35857

Merged
merged 2 commits into from
Nov 26, 2023

Conversation

pankajkoti
Copy link
Member

Using the feature built in #32646, when the scheduler marks
tasks stuck in queued as failed, send such an explicit log
indicating the action to the task logs so that it helps users
identify why exactly the task was marked failed in such a case.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Using the feature built in apache#32646, when the scheduler marks tasks
stuck in queued as failed, send such an explicit log indicating
the action to the task logs so that it helps users identify why
exactly the task was marked failed in such a case.
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Nov 26, 2023
@pankajkoti
Copy link
Member Author

Screenshot 2023-11-26 at 12 23 20 PM

@pankajkoti
Copy link
Member Author

cc: @RNHTTR @vatsrahul1001

@pankajkoti pankajkoti added this to the Airflow 2.8.0 milestone Nov 26, 2023
"Marking task instance %s stuck in queued as failed. "
"If the task instance has available retries, it will be retried.",
ti,
ti=ti,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but we would lose this information if the user disables the feature. I think both scheduler and task should have the information for the time being.
Another thing that worries me is the performance implication but in theory, I don't think there would be many tasks stuck in queued state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't loose this information. Even if the user disables the feature the TCL would log this information using this site's logger in the scheduler itself https://github.com/apache/airflow/pull/32646/files#diff-fb48bd1344270ccbaadb60b2b7fbc5d74bb5440f908eedd384bd25ada648c05dR91

Yes, we can test the performance. If it hits the performance badly, we can disable the feature and still have logs as mentioned above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the call site logger is set from the scheduler component while initialising TCL instance here: https://github.com/apache/airflow/pull/32646/files#diff-b0491913f69327937706aea8fc77a71efeb979897898e405ade2b162ad862476R239

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to log both separately. One is for reasons of clarity. The other is that the information you need is slightly different in the two contexts. E.g. with task instance log, the message doesn't need to reference the task instance details, cus it's implied by the context. But these are things we can tweak later since everything is private.

Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
@pankajkoti pankajkoti added the use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing) label Nov 26, 2023
@pankajkoti pankajkoti closed this Nov 26, 2023
@pankajkoti pankajkoti reopened this Nov 26, 2023
@pankajkoti pankajkoti merged commit c7e1306 into apache:main Nov 26, 2023
53 of 92 checks passed
@pankajkoti pankajkoti deleted the tcl-task-stuck-in-queued branch November 26, 2023 09:19
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants