Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a log message when a trigger is canceled for timeout #31720

Closed
1 task done
dstandish opened this issue Jun 5, 2023 · 8 comments · Fixed by #31757
Closed
1 task done

Add a log message when a trigger is canceled for timeout #31720

dstandish opened this issue Jun 5, 2023 · 8 comments · Fixed by #31757
Assignees
Labels
affected_version:main_branch Issues Reported for main branch area:async-operators AIP-40: Deferrable ("Async") Operators kind:bug This is a clearly a bug

Comments

@dstandish
Copy link
Contributor

Body

The trigger log doesn't show that a trigger timed out when it is canceled due to timeout.

We should try to see if we can add a log message that would show up in the right place. If we emit it from the trigger process, it might show up out of order.

But then again, if we ultimately don't need to go back to the task, that would not be a problem.

Additionally if we ultimately can "log from anywhere" then again, this would provide a clean solution.

This came up in PR discussion here: #30853 (comment)

The relevant trigger code is here: https://github.com/apache/airflow/blob/main/airflow/jobs/triggerer_job_runner.py#L598-L619

I think we could add logic so that when we receive a cancelled error (which could be for a few different reasons) then we can log the reason for the cancellation. I think we could just add an except CancelledError and then log the reason. We might need also to update the code in the location where we actually initiate the cancellation to add sufficient information for the log message.

cc @syedahsn @phanikumv @jedcunningham @pankajastro

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.
@dstandish dstandish added the kind:meta High-level information important to the community label Jun 5, 2023
@Lee-W
Copy link
Member

Lee-W commented Jun 6, 2023

If no one's working on it, I'd like to work on it. Thanks 🙂

@Lee-W
Copy link
Member

Lee-W commented Jun 6, 2023

I'd like to clarify what we're expecting. Following #30853 (comment), we want to make the code

https://github.com/apache/airflow/pull/30853/files#diff-773b6f4e9d2870f384d4e4e60763e8bedecfdb9020bf737b6557f0b9d0ca74a9R207-R210

works like the following?

        if attempt >= int(self.max_attempts):
            self.log.error("Pause Cluster Failed - max attempts reached.")
            yield TriggerEvent(
                {"status": "failure", "message": "Pause Cluster Failed - max attempts reached."}
            )

and also, we'd want to handle all other canceled cases, not just timeout

please let me know if I misunderstood anything, thanks!

@dstandish
Copy link
Contributor Author

OK so here's the message shared in that ticket:

[2023-05-08, 02:19:30 UTC] {credentials.py:617} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[2023-05-08, 02:19:38 UTC] {redshift_cluster.py:203} INFO - Status of cluster is pausing
[2023-05-08, 02:19:57 UTC] {redshift_cluster.py:203} INFO - Status of cluster is pausing
[2023-05-08, 02:20:15 UTC] {redshift_cluster.py:203} INFO - Status of cluster is pausing
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1144} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: example_redshift.pause_cluster manual__2023-05-08T02:11:38.809301+00:00 [queued]>
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1144} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: example_redshift.pause_cluster manual__2023-05-08T02:11:38.809301+00:00 [queued]>
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1343} INFO - Resuming after deferral
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1364} INFO - Executing <Task(RedshiftPauseClusterOperator): pause_cluster> on 2023-05-08 02:11:38.809301+00:00
[2023-05-08, 02:20:28 UTC] {standard_task_runner.py:57} INFO - Started process 328 to run task
[2023-05-08, 02:20:28 UTC] {standard_task_runner.py:84} INFO - Running: ['***', 'tasks', 'run', 'example_redshift', 'pause_cluster', 'manual__2023-05-08T02:11:38.809301+00:00', '--job-id', '17', '--raw', '--subdir', 'DAGS_FOLDER/example_redshift.py', '--cfg-path', '/tmp/tmp582aoom9']
[2023-05-08, 02:20:28 UTC] {standard_task_runner.py:85} INFO - Job 17: Subtask pause_cluster
[2023-05-08, 02:20:28 UTC] {task_command.py:410} INFO - Running <TaskInstance: example_redshift.pause_cluster manual__2023-05-08T02:11:38.809301+00:00 [running]> on host 9ed7c7091244
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1863} ERROR - Task failed with exception
airflow.exceptions.TaskDeferralError: Trigger/execution timeout
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1382} INFO - Marking task as FAILED. dag_id=example_redshift, task_id=pause_cluster, execution_date=20230508T021138, start_date=20230508T021920, end_date=20230508T022028

Notice that after the last Status of cluster is pausing message, there's nothing indicating that anything went wrong with the trigger (until the very end of the log)

So the author of that PR pointed out that, it sorta makes it look like the trigger completed successfully and the task resumed.

The goal is of this PR is to improve that somewhat.

Initially I was thinking we would be able to add a log message right there such as

[2023-05-08, 02:20:15 UTC] {redshift_cluster.py:203} INFO - Trigger cancelled due to timeout

or

[2023-05-08, 02:20:15 UTC] {redshift_cluster.py:203} INFO - Trigger cancelled: running elsewhere

Looking again, that might actually be more complicated and tougher to accomplish. At a minimum though, I think it would be good to add a log message here.

Something such as

except CancelledError:
    self.log.info("Trigger cancelled")

This alone would be an improvement.


I think in order to get more information than that, we obviously have somehow get the cancellation reason reason forwarded to the location above.

One way we can do that is through a msg passed to task.cancel. Here's some sample code demonstrating that:

import asyncio

async def cancel_me():
    print('cancel_me(): before sleep')

    try:
        # Wait for 1 hour
        await asyncio.sleep(3600)
    except asyncio.CancelledError as e:
        print(f"cancel_me was cancelled: {e.args[0]!r}")
        raise
    finally:
        print('cancel_me(): after sleep')

async def main():
    # Create a "cancel_me" Task
    task = asyncio.create_task(cancel_me())

    # Wait for 1 second
    await asyncio.sleep(1)

    task.cancel("this is why")
    try:
        await task
    except asyncio.CancelledError:
        print("main(): cancel_me is cancelled now")

asyncio.run(main())

So that would take us to here, where we actually initiate cancellation. But at that location, we don't currently know why we are cancelling. So we'd have to figure out some way to get access there to the needed information.


So, it's up to you, maybe you want to just get the quick win with the first option. Maybe you want to go for the more detailed solution.

@dstandish
Copy link
Contributor Author

And to clarify your specific question re max attempts -- we don't want to do anything with max attempts here. The author of that PR was citing the logging deficiency as a reason why they wanted to monitor max attempts in the trigger -- because otherwise, they would get no messaging that the trigger was cancelled until the very end of the logs. We can remedy that here by adding a message about the cancellation at the time when the task is actually cancelled. Then there would be no need to worry about such messaging concerns in the trigger itself.

@syedahsn
Copy link
Contributor

syedahsn commented Jun 6, 2023

This looks like a really good idea, and I think even the basic option creates a much better user experience. My question is about tracking the number of attempts. Is that inherently something we want to avoid? The current implementation uses both a timer as well as the number of attempts as a metric for determining how long a Trigger should run for. Isn't that better than having just the timer?
Also, would it be possible to yield a TriggerEvent (with an appropriate message) in the case of a CancelledError? That would be the most "natural" way to exit out of the Trigger.

@dstandish
Copy link
Contributor Author

dstandish commented Jun 6, 2023

My question is about tracking the number of attempts. Is that inherently something we want to avoid? The current implementation uses both a timer as well as the number of attempts as a metric for determining how long a Trigger should run for. Isn't that better than having just the timer?

To clarify, it's not like it's my "secret master plan" to get rid of the tracking of attempts that you added. Let sleeping dogs lie. And it's definitely not something I would veto anyway.

Mainly it's just that your PR called my attention to this deficiency in logging. And we can improve it by adding a logging message.

But then perhaps in future work there won't be a need to add such logic.

Now, do we want to avoid it adding max_attempts to trigger? I'm sure opinions differ on this. Personally, I don't like it because it's redundant, potentially misleading, and just doesn't seem to me to have a reason to be there.

Maybe if we had this better logging when you were working on that PR, you would not have added the "max attempts" logic but instead simply counted the attempts and printed them to log... e.g.

while True:
    counter += 1
    self.log.info("checking service; attempt %s", counter)
    ...

As discussed on the other PR, the way triggers are designed, enforcing max_attempts is somewhat redundant and, in special cases, misleading.

Come to think of it, it's also somewhat problematic when you think about what it means if you do some kind of exponential backoff. So, compared with max_attempts, an absolute timeout (which is already part of the interface) is simpler and more reliable / well-defined. So I wouldn't recommend using max_attempts on any new sensors. But of course, that's just me.

@syedahsn
Copy link
Contributor

syedahsn commented Jun 7, 2023

To clarify, it's not like it's my "secret master plan" to get rid of the tracking of attempts that you added

xD

Maybe if we had this better logging when you were working on that PR, you would not have added the "max attempts" logic but instead simply counted the attempts and printed them to log... e.g.

This is true - if the Trigger was more obvious about timing out, then I would have ignored the number of attempts. One of the main benefits I see of using the number of attempts is that it allows us to yield a TriggerEvent in case the Trigger times out (measured indirectly as exceeding the number of attempts). If it is possible to keep that behavior, that would be ideal.

@Lee-W
Copy link
Member

Lee-W commented Jun 7, 2023

So, it's up to you, maybe you want to just get the quick win with the first option. Maybe you want to go for the more detailed solution.

Thanks for providing all the details 🙏 I think I'll go with the first option first (created a PR #31757) and then take a deeper look afterward to see how I can add it

@eladkal eladkal added kind:bug This is a clearly a bug area:async-operators AIP-40: Deferrable ("Async") Operators affected_version:main_branch Issues Reported for main branch and removed kind:meta High-level information important to the community labels Jun 10, 2023
Lee-W added a commit to astronomer/airflow that referenced this issue Jun 16, 2023
pankajkoti added a commit to astronomer/airflow that referenced this issue Aug 29, 2023
We tried to address apache#31720 in PR apache#31757. The issue talked about
adding a trigger cancelled log when trigger timesout, but we also
added a generic Trigger canceled log. This log appears even in
successful runs of the triggers when they finish. This is confusing
some users as the log level is Error and there are sometimes quite
a few log lines saying "Trigger cancelled; err=" with giving no
clue as to what is happening.
So, I am removing this generic error log line and we can add specific
cancel scenarios with detailed reasons when we implement those.
potiuk pushed a commit that referenced this issue Aug 29, 2023
We tried to address #31720 in PR #31757. The issue talked about
adding a trigger cancelled log when trigger timesout, but we also
added a generic Trigger canceled log. This log appears even in
successful runs of the triggers when they finish. This is confusing
some users as the log level is Error and there are sometimes quite
a few log lines saying "Trigger cancelled; err=" with giving no
clue as to what is happening.
So, I am removing this generic error log line and we can add specific
cancel scenarios with detailed reasons when we implement those.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:main_branch Issues Reported for main branch area:async-operators AIP-40: Deferrable ("Async") Operators kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants