Add a log message when a trigger is canceled for timeout #31720

dstandish · 2023-06-05T18:55:20Z

Body

The trigger log doesn't show that a trigger timed out when it is canceled due to timeout.

We should try to see if we can add a log message that would show up in the right place. If we emit it from the trigger process, it might show up out of order.

But then again, if we ultimately don't need to go back to the task, that would not be a problem.

Additionally if we ultimately can "log from anywhere" then again, this would provide a clean solution.

This came up in PR discussion here: #30853 (comment)

The relevant trigger code is here: https://github.com/apache/airflow/blob/main/airflow/jobs/triggerer_job_runner.py#L598-L619

I think we could add logic so that when we receive a cancelled error (which could be for a few different reasons) then we can log the reason for the cancellation. I think we could just add an except CancelledError and then log the reason. We might need also to update the code in the location where we actually initiate the cancellation to add sufficient information for the log message.

cc @syedahsn @phanikumv @jedcunningham @pankajastro

Committer

I acknowledge that I am a maintainer/committer of the Apache Airflow project.

The text was updated successfully, but these errors were encountered:

Lee-W · 2023-06-06T03:14:08Z

If no one's working on it, I'd like to work on it. Thanks 🙂

Lee-W · 2023-06-06T11:03:21Z

I'd like to clarify what we're expecting. Following #30853 (comment), we want to make the code

https://github.com/apache/airflow/pull/30853/files#diff-773b6f4e9d2870f384d4e4e60763e8bedecfdb9020bf737b6557f0b9d0ca74a9R207-R210

works like the following?

        if attempt >= int(self.max_attempts):
            self.log.error("Pause Cluster Failed - max attempts reached.")
            yield TriggerEvent(
                {"status": "failure", "message": "Pause Cluster Failed - max attempts reached."}
            )

and also, we'd want to handle all other canceled cases, not just timeout

please let me know if I misunderstood anything, thanks!

dstandish · 2023-06-06T17:24:23Z

OK so here's the message shared in that ticket:

[2023-05-08, 02:19:30 UTC] {credentials.py:617} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[2023-05-08, 02:19:38 UTC] {redshift_cluster.py:203} INFO - Status of cluster is pausing
[2023-05-08, 02:19:57 UTC] {redshift_cluster.py:203} INFO - Status of cluster is pausing
[2023-05-08, 02:20:15 UTC] {redshift_cluster.py:203} INFO - Status of cluster is pausing
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1144} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: example_redshift.pause_cluster manual__2023-05-08T02:11:38.809301+00:00 [queued]>
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1144} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: example_redshift.pause_cluster manual__2023-05-08T02:11:38.809301+00:00 [queued]>
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1343} INFO - Resuming after deferral
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1364} INFO - Executing <Task(RedshiftPauseClusterOperator): pause_cluster> on 2023-05-08 02:11:38.809301+00:00
[2023-05-08, 02:20:28 UTC] {standard_task_runner.py:57} INFO - Started process 328 to run task
[2023-05-08, 02:20:28 UTC] {standard_task_runner.py:84} INFO - Running: ['***', 'tasks', 'run', 'example_redshift', 'pause_cluster', 'manual__2023-05-08T02:11:38.809301+00:00', '--job-id', '17', '--raw', '--subdir', 'DAGS_FOLDER/example_redshift.py', '--cfg-path', '/tmp/tmp582aoom9']
[2023-05-08, 02:20:28 UTC] {standard_task_runner.py:85} INFO - Job 17: Subtask pause_cluster
[2023-05-08, 02:20:28 UTC] {task_command.py:410} INFO - Running <TaskInstance: example_redshift.pause_cluster manual__2023-05-08T02:11:38.809301+00:00 [running]> on host 9ed7c7091244
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1863} ERROR - Task failed with exception
airflow.exceptions.TaskDeferralError: Trigger/execution timeout
[2023-05-08, 02:20:28 UTC] {taskinstance.py:1382} INFO - Marking task as FAILED. dag_id=example_redshift, task_id=pause_cluster, execution_date=20230508T021138, start_date=20230508T021920, end_date=20230508T022028

Notice that after the last Status of cluster is pausing message, there's nothing indicating that anything went wrong with the trigger (until the very end of the log)

So the author of that PR pointed out that, it sorta makes it look like the trigger completed successfully and the task resumed.

The goal is of this PR is to improve that somewhat.

Initially I was thinking we would be able to add a log message right there such as

[2023-05-08, 02:20:15 UTC] {redshift_cluster.py:203} INFO - Trigger cancelled due to timeout

or

[2023-05-08, 02:20:15 UTC] {redshift_cluster.py:203} INFO - Trigger cancelled: running elsewhere

Looking again, that might actually be more complicated and tougher to accomplish. At a minimum though, I think it would be good to add a log message here.

Something such as

except CancelledError:
    self.log.info("Trigger cancelled")

This alone would be an improvement.

I think in order to get more information than that, we obviously have somehow get the cancellation reason reason forwarded to the location above.

One way we can do that is through a msg passed to task.cancel. Here's some sample code demonstrating that:

import asyncio

async def cancel_me():
    print('cancel_me(): before sleep')

    try:
        # Wait for 1 hour
        await asyncio.sleep(3600)
    except asyncio.CancelledError as e:
        print(f"cancel_me was cancelled: {e.args[0]!r}")
        raise
    finally:
        print('cancel_me(): after sleep')

async def main():
    # Create a "cancel_me" Task
    task = asyncio.create_task(cancel_me())

    # Wait for 1 second
    await asyncio.sleep(1)

    task.cancel("this is why")
    try:
        await task
    except asyncio.CancelledError:
        print("main(): cancel_me is cancelled now")

asyncio.run(main())

So that would take us to here, where we actually initiate cancellation. But at that location, we don't currently know why we are cancelling. So we'd have to figure out some way to get access there to the needed information.

So, it's up to you, maybe you want to just get the quick win with the first option. Maybe you want to go for the more detailed solution.

dstandish · 2023-06-06T17:27:38Z

And to clarify your specific question re max attempts -- we don't want to do anything with max attempts here. The author of that PR was citing the logging deficiency as a reason why they wanted to monitor max attempts in the trigger -- because otherwise, they would get no messaging that the trigger was cancelled until the very end of the logs. We can remedy that here by adding a message about the cancellation at the time when the task is actually cancelled. Then there would be no need to worry about such messaging concerns in the trigger itself.

syedahsn · 2023-06-06T17:45:58Z

This looks like a really good idea, and I think even the basic option creates a much better user experience. My question is about tracking the number of attempts. Is that inherently something we want to avoid? The current implementation uses both a timer as well as the number of attempts as a metric for determining how long a Trigger should run for. Isn't that better than having just the timer?
Also, would it be possible to yield a TriggerEvent (with an appropriate message) in the case of a CancelledError? That would be the most "natural" way to exit out of the Trigger.

dstandish · 2023-06-06T19:38:59Z

My question is about tracking the number of attempts. Is that inherently something we want to avoid? The current implementation uses both a timer as well as the number of attempts as a metric for determining how long a Trigger should run for. Isn't that better than having just the timer?

To clarify, it's not like it's my "secret master plan" to get rid of the tracking of attempts that you added. Let sleeping dogs lie. And it's definitely not something I would veto anyway.

Mainly it's just that your PR called my attention to this deficiency in logging. And we can improve it by adding a logging message.

But then perhaps in future work there won't be a need to add such logic.

Now, do we want to avoid it adding max_attempts to trigger? I'm sure opinions differ on this. Personally, I don't like it because it's redundant, potentially misleading, and just doesn't seem to me to have a reason to be there.

Maybe if we had this better logging when you were working on that PR, you would not have added the "max attempts" logic but instead simply counted the attempts and printed them to log... e.g.

while True:
    counter += 1
    self.log.info("checking service; attempt %s", counter)
    ...

As discussed on the other PR, the way triggers are designed, enforcing max_attempts is somewhat redundant and, in special cases, misleading.

Come to think of it, it's also somewhat problematic when you think about what it means if you do some kind of exponential backoff. So, compared with max_attempts, an absolute timeout (which is already part of the interface) is simpler and more reliable / well-defined. So I wouldn't recommend using max_attempts on any new sensors. But of course, that's just me.

syedahsn · 2023-06-07T01:05:12Z

To clarify, it's not like it's my "secret master plan" to get rid of the tracking of attempts that you added

xD

Maybe if we had this better logging when you were working on that PR, you would not have added the "max attempts" logic but instead simply counted the attempts and printed them to log... e.g.

This is true - if the Trigger was more obvious about timing out, then I would have ignored the number of attempts. One of the main benefits I see of using the number of attempts is that it allows us to yield a TriggerEvent in case the Trigger times out (measured indirectly as exceeding the number of attempts). If it is possible to keep that behavior, that would be ideal.

Lee-W · 2023-06-07T10:18:21Z

So, it's up to you, maybe you want to just get the quick win with the first option. Maybe you want to go for the more detailed solution.

Thanks for providing all the details 🙏 I think I'll go with the first option first (created a PR #31757) and then take a deeper look afterward to see how I can add it

apache#31720

We tried to address apache#31720 in PR apache#31757. The issue talked about adding a trigger cancelled log when trigger timesout, but we also added a generic Trigger canceled log. This log appears even in successful runs of the triggers when they finish. This is confusing some users as the log level is Error and there are sometimes quite a few log lines saying "Trigger cancelled; err=" with giving no clue as to what is happening. So, I am removing this generic error log line and we can add specific cancel scenarios with detailed reasons when we implement those.

We tried to address #31720 in PR #31757. The issue talked about adding a trigger cancelled log when trigger timesout, but we also added a generic Trigger canceled log. This log appears even in successful runs of the triggers when they finish. This is confusing some users as the log level is Error and there are sometimes quite a few log lines saying "Trigger cancelled; err=" with giving no clue as to what is happening. So, I am removing this generic error log line and we can add specific cancel scenarios with detailed reasons when we implement those.

dstandish added the kind:meta High-level information important to the community label Jun 5, 2023

phanikumv assigned Lee-W Jun 6, 2023

Lee-W mentioned this issue Jun 7, 2023

feat(jobs/triggerer_job_runner): add triggerer canceled log #31757

Merged

eladkal added kind:bug This is a clearly a bug area:async-operators AIP-40: Deferrable ("Async") Operators affected_version:main_branch Issues Reported for main branch and removed kind:meta High-level information important to the community labels Jun 10, 2023

Lee-W added a commit to astronomer/airflow that referenced this issue Jun 16, 2023

feat(jobs/triggerer_job_runner): add triggerer canceled log

49c58d4

apache#31720

dstandish closed this as completed in #31757 Jun 16, 2023

ephraimbuddy mentioned this issue Jul 7, 2023

Status of testing of Apache Airflow 2.6.3rc1 #32432

Closed

53 tasks

This was referenced Aug 29, 2023

Remove generic trigger cancelled error log #33874

Merged

Log appropriate trigger cancel reasons #33877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a log message when a trigger is canceled for timeout #31720

Add a log message when a trigger is canceled for timeout #31720

dstandish commented Jun 5, 2023

Lee-W commented Jun 6, 2023

Lee-W commented Jun 6, 2023 •

edited

Loading

dstandish commented Jun 6, 2023

dstandish commented Jun 6, 2023

syedahsn commented Jun 6, 2023

dstandish commented Jun 6, 2023 •

edited

Loading

syedahsn commented Jun 7, 2023

Lee-W commented Jun 7, 2023

Add a log message when a trigger is canceled for timeout #31720

Add a log message when a trigger is canceled for timeout #31720

Comments

dstandish commented Jun 5, 2023

Body

Committer

Lee-W commented Jun 6, 2023

Lee-W commented Jun 6, 2023 • edited Loading

dstandish commented Jun 6, 2023

dstandish commented Jun 6, 2023

syedahsn commented Jun 6, 2023

dstandish commented Jun 6, 2023 • edited Loading

syedahsn commented Jun 7, 2023

Lee-W commented Jun 7, 2023

Lee-W commented Jun 6, 2023 •

edited

Loading

dstandish commented Jun 6, 2023 •

edited

Loading