-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Actions] Treat failures as successes for Task Manager #109655
[Actions] Treat failures as successes for Task Manager #109655
Conversation
676c013
to
c9ddb78
Compare
To preempt a question which will likely come up in review:
|
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Verified that failed actions no longer leave task manager documents or action_task_params
documents. Just a few comments about documentation and log messages. Nice work!
@@ -123,6 +124,14 @@ The root `status` indicates the `status` of the system overall. | |||
|
|||
The Runtime `status` indicates whether task executions have exceeded any of the <<task-manager-configuring-health-monitoring,configured health thresholds>>. An `OK` status means none of the threshold have been exceeded. A `Warning` status means that at least one warning threshold has been exceeded. An `Error` status means that at least one error threshold has been exceeded. | |||
|
|||
[IMPORTANT] | |||
============================================== | |||
Some tasks (such as <<action-types,connectors>>) will incorrectly report their status as successful even if the task failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this part should be more generic since task manager can run any type of task, and we should add a section to the Actions and Connectors
docs that specifically reference the event log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you thinking it should say? actions are the only known culprit of this and it does say Some tasks
to make it seem like it's not an only action thing. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering if it might lead people reading these docs to try to use the event log to look up failures for other tasks.
@gchaps Any suggestions for wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need more context to provide wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I missed this. I saw you added feedback. Did you still need more context?
0e57e76
to
975ad85
Compare
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@elasticmachine merge upstream |
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Public APIs missing exports
History
To update your PR or re-run it, just comment with: |
* Support retry with email as an example * Fix tests * Add logic to treat as failure if there is a retry * Handle retry better * Make this optional * Tweaks * Remove unnecessary code * Fix existing tests * Add some unit tests * Add test * Add doc note * More docs * PR feedback * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
…1769) * Support retry with email as an example * Fix tests * Add logic to treat as failure if there is a retry * Handle retry better * Make this optional * Tweaks * Remove unnecessary code * Fix existing tests * Add some unit tests * Add test * Add doc note * More docs * PR feedback * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> * Update docs/management/action-types.asciidoc Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Relates to #55340
This PR changes how action execution failures are handled for Task Manager. Prior to this PR, if action execution threw an exception or returned an error, Task Manager would attempt to retry the task until the
maxAttempts
for the task definition was reached. If unable to retry, the task would be marked asFailed
and stored forever as such.Persisting these forever is problematic, especially during upgrades when migrations are ran across every saved object. The amount of failure action task documents can grow exponentially based on the rule and number of actions assigned (especially considering that some rules allow "group by" where a single rule execution can spawn many action tasks). In addition to this,
action_task_params
saved objects are created alongside action task saved objects and are not removed if actions fail.In #96971, we added a task that will clean up these failed action tasks, but we should really stop persisting these if we plan to just clean them up later anyways.
This PR aims to do that by telling Task Manager that the action was successful so it will remove it.
This does solve the problem as described above, but it also introduces new challenges, mainly that the task manager health api will report all failed actions as successful now which we will need to note in our documentation.
Testing
You need to create a scenario where actions are failing and ensure that failed actions aren't persisted in TM: