Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display Action Group in Alert Details page #82275

Closed
gmmorris opened this issue Nov 2, 2020 · 16 comments · Fixed by #82645
Closed

Display Action Group in Alert Details page #82275

gmmorris opened this issue Nov 2, 2020 · 16 comments · Fixed by #82645
Assignees
Labels
enhancement New value added to drive a business result Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Nov 2, 2020

part of #64077

Addresses:

  • Including the Action Group in Event Log when Instances are activated
  • Display Action Group on Alert Instances in Alert Details page

see Meta issue for details

@gmmorris gmmorris added enhancement New value added to drive a business result Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Nov 2, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@ymao1 ymao1 self-assigned this Nov 2, 2020
@ymao1
Copy link
Contributor

ymao1 commented Nov 3, 2020

A few questions:

  1. Currently, if an alert instance is muted, the actions don't run when the instance becomes active. The logic is skipped altogether so we won't know which action group would have run. Do we want to change that so that we do know what would have been run without actually running the action? Or skip showing the action group for muted alert instances?

  2. Will there ever be a case where an alert instance becomes active and sets off two action groups? I.e. will be allowing overlapping conditions for action group buckets?

@mikecote
Copy link
Contributor

mikecote commented Nov 3, 2020

  1. Currently, if an alert instance is muted, the actions don't run when the instance becomes active. The logic is skipped altogether so we won't know which action group would have run. Do we want to change that so that we do know what would have been run without actually running the action? Or skip showing the action group for muted alert instances?

My guess would be the former where we do track in some way while not executing the actions. From what I recall, this is the primary difference between muting and disabling (we still log and track what group the alert would be part of yet don't execute the actions). Though I'm sure @pmuellr has put more thought on this than I have.

  1. Will there ever be a case where an alert instance becomes active and sets off two action groups? I.e. will be allowing overlapping conditions for action group buckets?

Not at this time, we have safety-guards in place to prevent that here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/alerts/server/alert_instance/alert_instance.ts#L69.

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2020

  1. Currently, if an alert instance is muted, the actions don't run when the instance becomes active. The logic is skipped altogether so we won't know which action group would have run. Do we want to change that so that we do know what would have been run without actually running the action? Or skip showing the action group for muted alert instances?

My guess would be the former where we do track in some way while not executing the actions. From what I recall, this is the primary difference between muting and disabling (we still log and track what group the alert would be part of yet don't execute the actions). Though I'm sure @pmuellr has put more thought on this than I have.

I agree. Would be super nice to know what would be getting triggered, during a mute. No idea how we might make this happen tho. I think an active-instance event log record should be available, which should eventually have the action group in there (doesn't today). That doesn't sound like fun though, looking for those ... Another thought would be expecting this info in the alert state instead, so we wouldn't have to troll through the event log. Not sure the action group is in there, or if it even makes sense for it to go in there.

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2020

Ya, it does look like maybe it's in the alert state today. This is from an alert with one active instance:

$ curl -k $KBN_URLBASE/api/alerts/alert/fedea4aa-dbed-4274-a92e-2b5af3f5053f/state | json
{
  "alertInstances": {
    "host-1": {
      "state": {},
      "meta": {
        "lastScheduledActions": {
          "group": "threshold met",
          "date": "2020-11-03T18:17:27.360Z"
        }
      }
    }
  },
  "previousStartedAt": "2020-11-03T18:17:27.256Z"
}

I'm think lastScheduledActions implies "this is the state the last time the executor ran", vs "here's a list of every instance (eg, host-1) and the info about the last time it scheduled an action".

But of course, we aren't scheduling an action, so this field likely isn't set when muted or throttled, but ???. We might want to add a property as a peer of lastScheduleActions which would be lastActiveGroup or something, and have the same info. Or perhaps we could add a boolean to the current { group, date } object, which would indicate whether the action was scheduled or not (not implying muting or throttling).

@ymao1
Copy link
Contributor

ymao1 commented Nov 3, 2020

We might want to add a property as a peer of lastScheduleActions which would be lastActiveGroup or something, and have the same info. Or perhaps we could add a boolean to the current { group, date } object, which would indicate whether the action was scheduled or not (not implying muting or throttling).

I like the idea of adding a lastActiveGroup property to keep track of the actions that would have been executed if the alert hadn't been muted. That way we could still use lastScheduledActions to know when the last scheduled action execution was, which may be useful to know.

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2020

I like the idea of adding a lastActiveGroup property to keep track of the actions that would have been executed if the alert hadn't been muted. That way we could still use lastScheduledActions to know when the last scheduled action execution was, which may be useful to know.

Ya, we'll have to take a closer look at lastScheduledActions, and see if it's doing what it says on the tin (only noting last time it was scheduled, vs last time it was active) - it may already be doing what we want, and is just not named well. But I'd guess that's not the case, and a set of peer objects for instances that are active but not scheduled would make sense.

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2020

hahahahaha - issue #57173 (but different naming concern)

took a quick look at the code, looks like the name is correct in that it only captures info on instances with scheduled actions, not muted or throttled ones:

if (!muteAll) {
const mutedInstanceIdsSet = new Set(mutedInstanceIds);
await Promise.all(
Object.entries(instancesWithScheduledActions)
.filter(
([alertInstanceName, alertInstance]: [string, AlertInstance]) =>
!alertInstance.isThrottled(throttle) && !mutedInstanceIdsSet.has(alertInstanceName)
)
.map(([id, alertInstance]: [string, AlertInstance]) =>
this.executeAlertInstance(id, alertInstance, executionHandler)
)
);
}

If you look at executeAlertInstance(), it's the only code that updates the lastScheduledActions via alertInstance.updateLastScheduledActions()

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2020

Another thought on the implementation - while a "peer" object of lastScheduledActions sounds appealing, it's going to be a bit of PITA I'd think trying to combine or pull the data from two separate objects, for use in the UI. I wonder if we should put all the instances in there, and add some additional properties to the object, indicating if the instance didn't get scheduled because it the alert was throttled, because the alert was muted, or because the alert instance was muted. We could make these "default" to false, so in practice there would only be one of those properties actually in the object, and it would have a value of true. Or we could have a property which indicated it wasn't scheduled, with a "reason" for why it wasn't scheduled, eg,

notScheduled: 'alert-throttled' | 'alert-muted' | 'instance-muted'

@ymao1
Copy link
Contributor

ymao1 commented Nov 3, 2020

I was thinking we should try to get the action group id into the event log when event.action = active-instance vs using the alert state directly in the alerts details (which is what I think you're suggesting?) The reason being that the alert state is a current snapshot in time of the alert whereas if we will eventually be adding time filtering to the alert details view, we'll want to be pulling historical action group id info for when alert instances went active at a time period in the past. Am I thinking about this correctly?

@ymao1
Copy link
Contributor

ymao1 commented Nov 3, 2020

Do you think it would be clearer to have a lastActiveActionGroup

"instance-name": {
    "state": {},
    "meta": {
        "lastActiveActionGroup": {
            "group": "threshold met",
            "date": "2020-11-03T18:17:27.360Z"
            "status": 'scheduled' | 'alert-throttled' | 'alert-muted' | 'instance-muted'
        }
    }
}

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2020

I was thinking we should try to get the action group id into the event log when event.action = active-instance vs using the alert state directly in the alerts details (which is what I think you're suggesting?)

Ah, hmmm. Maybe ...

So, then the next question is, do we write active-instance event log docs for muted / throttled instances ...

It seems like we should, if we aren't. So ... yeah, I think any instance that shows up on this page will be from one of the ***-instance event log docs, and so if they have an action group in them, seems like this would work ...

There's an advantage to keeping this stuff in the alert state itself, as we don't have to query and process the event log to get the info. But since we're already doing that on this page, shouldn't impact performance / latency from the current situation. And we could add it to the alert state later, if we wanted "instant" access to it.

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2020

Do you think it would be clearer to have a lastActiveActionGroup

Let's see if we can plumb the data we need just through the event log for now, will be the least disruptive change I think. And we def want the action group in the event log docs anyway. If we do need to put something in the state, we'll probably want to evaluate a couple of different shapes ...

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2020

Another thought, but probably another issue/PR. We probably don't show muted state per instance in the alert details right now, but should. Longer term, we talked about throttling instances, and also "snoozing" (probably alerts and instances, where snooze is like a timed mute, different than throttle as throttle will "disengage" when the alert (/instance) status changes or the action group it's scheduling changes).

@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2020

Currently, if an alert instance is muted, the actions don't run when the instance becomes active. The logic is skipped altogether so we won't know which action group would have run.

I was incorrect when I said this. We do have the information for what action group would have been run for muted alerts and muted active alerts are already being written to the event log.

@mikecote
Copy link
Contributor

mikecote commented Nov 4, 2020

I just caught up with the thread, I'm +1 on exploring the event log path to store the information.

My gut feeling is telling me once we have event log UIs, it could be useful history information to display to users.

We could develop a bar chart of the active instances in the alert details page and group the count by action groups (2hrs ago; 3 severe, 1 warning; 1hr ago 1 severe, 2 warning, etc) I think this piece was part of some original designs #56280 (comment).

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants