Ability to fire actions when an alert instance is resolved #49405

mikecote · 2019-10-25T20:41:33Z

The has been a lot of requests for users wanting to be notified whenever an alert instance is no longer considered "active". The determination of this is done whenever the event log logs a resolved-instance event here.

This issue should add a static action group called resolved that is available to all alert types and allows alert actions to attach themselves to it. This action group should fire using the same logic used to log resolved-instance and should leverage the UI work done in #64077 to add a "Resolved" option to the group dropdown.

NOTE: Some UI work will be blocked until #64077 is finished, the server side work can be done in parallel.

Steps:

1. Static resolved action group to fire when an alert instance is resolved Static resolved action group to fire when an alert instance is resolved #82366
2. Ability to assign alert actions to resolved action group in UI Ability to assign alert actions to resolved action group in UI #82367

Original description

One option is to create some sort of "resolved" action group. Alert actions can assign themselves to this new action group. The new action group would fire when an alert instance stops firing (considered resolved when nothing to alert on anymore).

Another option may be to give the alert type executor the tools it needs to manually mark alert instances as resolved.

Note for docs

User guides will need to be updated for this new feature. Maybe the PagerDuty connector docs should also be updated to indicate how to "resolve" a PagerDuty incident when the alert has recovered.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-25T20:41:44Z

Pinging @elastic/kibana-stack-services (Team:Stack Services)

pmuellr · 2020-04-10T12:51:55Z

Seems like the easiest thing to do is to handle it as a a "built-in" action group. An alert executor wouldn't have to do anything explicit. When an instanceId which had scheduled actions on it's previous turn did not schedule actions on it's current turn, we would treat this instanceId as resolved, and schedule the actions for the "resolved" action group. Every alert would have this action group available, for free.

Raises the following questions:

what is the name of this action group? Let's say it's "resolved" for the subsequent questions
can an alert type define an action group with this name? What other "reserved" action group names do we think we might have in the future?
can an alert type specifically schedule actions for this group itself?
should there be more knobs and dials, like being able to specify how many alert executions must not have scheduled actions before firing the "resolved" action group? Or perhaps time-based?

chrisronline · 2020-04-16T14:54:23Z

Stack Monitoring is doing this right now, but it's not the solution we want long term.

We are detecting two separate scenarios and firing actions for both:

We call scheduleActions with different data based on the above two scenarios.

Ideally, we don't have to worry about this and the alerting solution provides this for us.

It feels like something every solution will want, as it helps close the loop on the fixing the underlying issue causing the alert.

Zacqary · 2020-05-05T18:44:34Z

@chrisronline Did y'all figure out how to handle recovery alerts when a renotify period is specified? I'm trying to implement these for Metrics right now, and I can only get it to work without a renotify period. When I do specify one:

scheduleActions does not schedule any actions when it's called between renotify intervals
alertInstance.replaceState still behaves as expected, persisting the new state or clearing it based on whether scheduleActions is called
If The alert is not firing, but was firing the last time the alert ran occurs during the blackout period between renotifications, the user won't get the recovery alert
The next alert they receive will probably be another The alert is firing message, and we're back to square one of having no recovery alerts configured

I'm not sure how to get around this without full support for multiple action groups, which I believe is tracked in #64077

Zacqary · 2020-05-07T18:36:58Z

Spoke to @chrisronline and apparently they don't have a solution.

Right now a single action group has to do double-duty for both Fired and Recovered alerts, and the alerts can get swallowed if throttling is turned on.

We definitely need core support from the plugin for this.

mikecote · 2020-05-07T18:56:07Z

Thanks for the input @Zacqary. We're aiming to have a built-in solution in 7.9, stay tuned!

Zacqary · 2020-06-15T16:40:47Z

Any update on this? Still have 7.9 work in Metrics UI blocked by this so I'm wondering if y'all have a timetable

mikecote · 2020-06-16T19:05:37Z

@Zacqary The best update I have is aiming for late 7.9 or 7.10 at this point.

shahzad31 · 2020-09-04T14:07:21Z

We have received lots of uptime alerts users requests for this, i appreciate this being prioritised.

justinfiore · 2020-10-26T12:45:29Z

Any update on this issue? Did it make 7.9? Or will it be 7.10? Or later.
An alerting system that can't fire actions on resolution is of limited utility to our company.
This is why we use Grafana alerts on our Elastic data (which is not ideal for many reasons).
If this feature was implemented, we would definitely switch over to Kibana Alerting.

mikecote · 2020-10-26T12:49:21Z

Hi @justinfiore, this is planned to be part of the 7.11 release and to be worked on starting soon. Stay tuned!

This issue will rely on the UI work done in #64077 and add a "resolved" group to all types of alerts.

mikecote · 2020-11-05T15:26:44Z

I'm wondering if some types of alerts would like to opt-out of this functionality if ever it doesn't make sense to them?

Example:
If the tracking alert uses two action groups like inside / contained and outside a polygon. Would there be a use case for the resolved action group? Or would it be better to use inside / contained and resolved (to represent outside a polygon)?

cc @YulNaumenko

pmuellr · 2020-11-05T22:52:59Z

response from some questions I asked here #49405 (comment), as it relates to an initial implementation:

can an alert type specifically schedule actions for this group itself?

I think, no, for now. I think we could prevent this, but if we can't, that's fine for now as well.

should there be more knobs and dials, like being able to specify how many alert executions must not have scheduled actions before firing the "resolved" action group? Or perhaps time-based?

Not now.

And a response to a question from @mikecote in #49405 (comment)

I'm wondering if some types of alerts would like to opt-out of this functionality if ever it doesn't make sense to them?

I raised a question in PR #82645 (comment), regarding what events should be generated when an alert "switches" action groups while it's active. One option, is that it might "resolve" with the old group and then new-instance on the new group. I think maybe we need to answer that before wondering whether opt-out is something to consider.

But in any case, it would be opt-out at the alert type, you're thinking, right? So seems like it would be easy to add this later if we needed to.

pmuellr · 2020-11-05T23:56:15Z

@YulNaumenko were just chatting about what "context" variables would be available for the resolved action group. Answer (today): none (I think).

I hope we can make all the other variables available, except the context ones; they're set here:

kibana/x-pack/plugins/alerts/server/task_runner/transform_action_params.ts

Lines 45 to 56 in 26f79a6

    
             const variables = { 
        
               alertId, 
        
               alertName, 
        
               spaceId, 
        
               tags, 
        
               alertInstanceId, 
        
               context, 
        
               state, 
        
               params: alertParams, 
        
             }; 
        
             return Mustache.render(value, variables); 
        
           });

I'd hate to have to do more GETs to get the variable values for the bits listed there (including the state), since we presumably have all that info at some point during the call to the alert type executor, but it might end up being difficult to make it available where we need it. A couple more GETs for a resolved action execution would probably be worth it though, for now, if it made the code simpler.

In theory, we could probably arrange for the alert executors to make the context variables available in such a way that we could get them for the resolved action execution, but it would mean a change to the way the context is set in the alert type, so I'm hesitant to make a change like that now. Eg, a new API in the services object passed in to set the context variables, independent of scheduling actions.

mikecote · 2020-11-06T00:44:15Z

But in any case, it would be opt-out at the alert type, you're thinking, right? So seems like it would be easy to add this later if we needed to.

That's correct and agree it's something we can handle later. Looks like we can discuss it in #82792 🙂

I hope we can make all the other variables available, except the context ones; they're set here

I think I'm +1 on skipping context for now and we can revisit later.

YulNaumenko · 2020-11-06T15:43:24Z

I have a Draft PR where introduced the solution without the possibility to use the alert execution context. If we are OK with the approach, then we need to handle available variables at least in the UI, because it can raise a bunch of questions and SDHs.
Still need to figure out how to be with the alert types like tracking alert, at least need to make it works as by design with the current changes.

pmuellr · 2020-11-06T16:33:11Z

I have a Draft PR where introduced the solution without the possibility to use the alert execution context. If we are OK with the approach, then we need to handle available variables at least in the UI, because it can raise a bunch of questions and SDHs.

Yeah, I'm fine with this approach for now, don't see how we can get the context vars in there.

In terms of the UI, we don't actually have anything that prevents people from entering invalid variable names now, right? But we do have the list of variables that pops up, that can be inserted via selecting them. Can we have the UI identify that it's editing an action in the "resolved" group, and not list those variables in the pop-up? Absolute worst-case, since I think the plan is to somehow provide a link to doc for the available mustache variables per alert, we could note in that doc that the context.* variables are not available in the resolved action group.

Still need to figure out how to be with the alert types like tracking alert, at least need to make it works as by design with the current changes.

The geo-tracking alert? My understanding is that it will NEVER be in a resolved state, it would always be in an "inside" or "outside" action group. But they could perhaps change the design so that the customer selects if they want to run actions on "inside" or "outside", and when the alert "crosses the line" from the customer selected "side" to the other one, we'd fire the resolved group. I'm not seeing any issues with this alert regarding the resolved action group, but maybe I'm missing something ...

mukeshelastic · 2020-12-18T00:54:08Z

@arisonl In order for the auto-closing a PD incident when an alert is resolved, do users needs to do something in rule configuration? If not, great. If so, is that documented?

mikecote · 2020-12-18T12:24:33Z

@mukeshelastic users will need to do something in the rule configuration in order to make PD incidents close when an alert recovers. We've made the steps a few less clicks with this PR where the user just needs to add another PD action that "Runs When: Recovered" and the form settings will be populated with default values that resolve the incident.

The PR has a needs_docs label so we'll definitely clarify this in the documentation with the necessary steps.

arisonl · 2020-12-18T13:24:29Z

@mukeshelastic As Mike mentions, users will need to specify on the UI that they want the PD incident to be resolved when the alert recovers. This will be documented but we also plan to go beyond documentation by releasing additional "how-to" content around actions and integrations.

mikecote added Feature:Alerting Team:Stack Services labels Oct 25, 2019

mikecote mentioned this issue Nov 25, 2019

Resolved alert action group #47712

Closed

bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019

Zacqary mentioned this issue May 4, 2020

[Metrics Alerts] Send a recovery alert when an alert returns to an OK state #64351

Closed

sgrodzicki added this to the Metrics UI 7.9 milestone May 25, 2020

mikecote mentioned this issue Jun 2, 2020

Dependencies on Kibana Alerting #67992

Open

59 tasks

pmuellr mentioned this issue Jun 30, 2020

[Alerting] Send alert when Uptime determines monitor is back up #70165

Closed

sgrodzicki removed this from the Metrics UI 7.9 milestone Jul 14, 2020

mukeshelastic mentioned this issue Jul 14, 2020

[Logs Alerting] Send recovery alert notification when an alert condition is no longer true for a previusly triggered alert #71685

Closed

mikecote added the enhancement New value added to drive a business result label Aug 19, 2020

chrisronline mentioned this issue Sep 4, 2020

[Monitoring] Disk usage alerting #75419

Merged

gmmorris mentioned this issue Sep 17, 2020

Alerts don't group their incidents by a dedupKey / Object ID #77772

Closed

igoristic mentioned this issue Sep 29, 2020

[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724

Closed

mikecote mentioned this issue Oct 30, 2020

[Meta] UI ability to assign alert actions per action group #64077

Closed

4 tasks

mikecote mentioned this issue Nov 2, 2020

Static resolved action group to fire when an alert instance is resolved #82366

Closed

mikecote added the Meta label Nov 2, 2020

mikecote mentioned this issue Nov 2, 2020

Ability to assign alert actions to resolved action group in UI #82367

Closed

YulNaumenko mentioned this issue Nov 6, 2020

Added ability to fire actions when an alert instance is resolved #82799

Merged

YulNaumenko closed this as completed in #82799 Nov 14, 2020

YulNaumenko self-assigned this Nov 15, 2020

mikecote mentioned this issue Nov 16, 2020

Should we rename the resolved action group? #83464

Closed

mikecote mentioned this issue Nov 23, 2020

Ability to resolve ServiceNow, IBM Resilient and Jira incidents #83221

Open

mikecote added the needs_docs label Dec 8, 2020

mikecote mentioned this issue Dec 30, 2020

[Alerting] Enforces typing of Alert's ActionGroups #86761

Merged

8 tasks

gmmorris mentioned this issue Dec 30, 2020

RuleTypes can't provide an AlertContext on recovery #87048

Closed

3 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to fire actions when an alert instance is resolved #49405

Ability to fire actions when an alert instance is resolved #49405

mikecote commented Oct 25, 2019 •

edited

Loading

elasticmachine commented Oct 25, 2019

pmuellr commented Apr 10, 2020

chrisronline commented Apr 16, 2020

Zacqary commented May 5, 2020

Zacqary commented May 7, 2020

mikecote commented May 7, 2020

Zacqary commented Jun 15, 2020

mikecote commented Jun 16, 2020 •

edited

Loading

shahzad31 commented Sep 4, 2020

justinfiore commented Oct 26, 2020

mikecote commented Oct 26, 2020

mikecote commented Nov 5, 2020

pmuellr commented Nov 5, 2020

pmuellr commented Nov 5, 2020

mikecote commented Nov 6, 2020

YulNaumenko commented Nov 6, 2020

pmuellr commented Nov 6, 2020

mukeshelastic commented Dec 18, 2020

mikecote commented Dec 18, 2020 •

edited

Loading

arisonl commented Dec 18, 2020

Ability to fire actions when an alert instance is resolved #49405

Ability to fire actions when an alert instance is resolved #49405

Comments

mikecote commented Oct 25, 2019 • edited Loading

Note for docs

elasticmachine commented Oct 25, 2019

pmuellr commented Apr 10, 2020

chrisronline commented Apr 16, 2020

Zacqary commented May 5, 2020

Zacqary commented May 7, 2020

mikecote commented May 7, 2020

Zacqary commented Jun 15, 2020

mikecote commented Jun 16, 2020 • edited Loading

shahzad31 commented Sep 4, 2020

justinfiore commented Oct 26, 2020

mikecote commented Oct 26, 2020

mikecote commented Nov 5, 2020

pmuellr commented Nov 5, 2020

pmuellr commented Nov 5, 2020

mikecote commented Nov 6, 2020

YulNaumenko commented Nov 6, 2020

pmuellr commented Nov 6, 2020

mukeshelastic commented Dec 18, 2020

mikecote commented Dec 18, 2020 • edited Loading

arisonl commented Dec 18, 2020

mikecote commented Oct 25, 2019 •

edited

Loading

mikecote commented Jun 16, 2020 •

edited

Loading

mikecote commented Dec 18, 2020 •

edited

Loading