-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a log message when a trigger is canceled for timeout #31720
Comments
If no one's working on it, I'd like to work on it. Thanks 🙂 |
I'd like to clarify what we're expecting. Following #30853 (comment), we want to make the code works like the following? if attempt >= int(self.max_attempts):
self.log.error("Pause Cluster Failed - max attempts reached.")
yield TriggerEvent(
{"status": "failure", "message": "Pause Cluster Failed - max attempts reached."}
) and also, we'd want to handle all other canceled cases, not just timeout please let me know if I misunderstood anything, thanks! |
OK so here's the message shared in that ticket:
Notice that after the last So the author of that PR pointed out that, it sorta makes it look like the trigger completed successfully and the task resumed. The goal is of this PR is to improve that somewhat. Initially I was thinking we would be able to add a log message right there such as
or
Looking again, that might actually be more complicated and tougher to accomplish. At a minimum though, I think it would be good to add a log message here. Something such as
This alone would be an improvement. I think in order to get more information than that, we obviously have somehow get the cancellation reason reason forwarded to the location above. One way we can do that is through a
So that would take us to here, where we actually initiate cancellation. But at that location, we don't currently know why we are cancelling. So we'd have to figure out some way to get access there to the needed information. So, it's up to you, maybe you want to just get the quick win with the first option. Maybe you want to go for the more detailed solution. |
And to clarify your specific question re max attempts -- we don't want to do anything with max attempts here. The author of that PR was citing the logging deficiency as a reason why they wanted to monitor max attempts in the trigger -- because otherwise, they would get no messaging that the trigger was cancelled until the very end of the logs. We can remedy that here by adding a message about the cancellation at the time when the task is actually cancelled. Then there would be no need to worry about such messaging concerns in the trigger itself. |
This looks like a really good idea, and I think even the basic option creates a much better user experience. My question is about tracking the number of attempts. Is that inherently something we want to avoid? The current implementation uses both a timer as well as the number of attempts as a metric for determining how long a Trigger should run for. Isn't that better than having just the timer? |
To clarify, it's not like it's my "secret master plan" to get rid of the tracking of attempts that you added. Let sleeping dogs lie. And it's definitely not something I would veto anyway. Mainly it's just that your PR called my attention to this deficiency in logging. And we can improve it by adding a logging message. But then perhaps in future work there won't be a need to add such logic. Now, do we want to avoid it adding Maybe if we had this better logging when you were working on that PR, you would not have added the "max attempts" logic but instead simply counted the attempts and printed them to log... e.g.
As discussed on the other PR, the way triggers are designed, enforcing max_attempts is somewhat redundant and, in special cases, misleading. Come to think of it, it's also somewhat problematic when you think about what it means if you do some kind of exponential backoff. So, compared with max_attempts, an absolute timeout (which is already part of the interface) is simpler and more reliable / well-defined. So I wouldn't recommend using max_attempts on any new sensors. But of course, that's just me. |
xD
This is true - if the Trigger was more obvious about timing out, then I would have ignored the number of attempts. One of the main benefits I see of using the number of attempts is that it allows us to yield a |
Thanks for providing all the details 🙏 I think I'll go with the first option first (created a PR #31757) and then take a deeper look afterward to see how I can add it |
We tried to address apache#31720 in PR apache#31757. The issue talked about adding a trigger cancelled log when trigger timesout, but we also added a generic Trigger canceled log. This log appears even in successful runs of the triggers when they finish. This is confusing some users as the log level is Error and there are sometimes quite a few log lines saying "Trigger cancelled; err=" with giving no clue as to what is happening. So, I am removing this generic error log line and we can add specific cancel scenarios with detailed reasons when we implement those.
We tried to address #31720 in PR #31757. The issue talked about adding a trigger cancelled log when trigger timesout, but we also added a generic Trigger canceled log. This log appears even in successful runs of the triggers when they finish. This is confusing some users as the log level is Error and there are sometimes quite a few log lines saying "Trigger cancelled; err=" with giving no clue as to what is happening. So, I am removing this generic error log line and we can add specific cancel scenarios with detailed reasons when we implement those.
Body
The trigger log doesn't show that a trigger timed out when it is canceled due to timeout.
We should try to see if we can add a log message that would show up in the right place. If we emit it from the trigger process, it might show up out of order.
But then again, if we ultimately don't need to go back to the task, that would not be a problem.
Additionally if we ultimately can "log from anywhere" then again, this would provide a clean solution.
This came up in PR discussion here: #30853 (comment)
The relevant trigger code is here: https://github.com/apache/airflow/blob/main/airflow/jobs/triggerer_job_runner.py#L598-L619
I think we could add logic so that when we receive a cancelled error (which could be for a few different reasons) then we can log the reason for the cancellation. I think we could just add an
except CancelledError
and then log the reason. We might need also to update the code in the location where we actually initiate the cancellation to add sufficient information for the log message.cc @syedahsn @phanikumv @jedcunningham @pankajastro
Committer
The text was updated successfully, but these errors were encountered: