[discuss] extending event log for faster/easier access to active instance date information #93704

pmuellr · 2021-03-04T22:49:19Z

Currently in alerting, we generate events for the event log as follows:

execute - when the alert executes
active-instance - for every active instance, after alert execution
new-instance - for every newly found instance, after alert execution
recovered-instance - for all previously active instances, which are no longer active, after the alert execution

These are all "stateless" documents, so to get the duration of an active instance, you need to get the new-instance document to figure out when it started. Likewise, if you wanted to know the duration of the last time an instance was active, you'd have to start with it's recovered-instance document, and then search back in time to find the new-instance document. We've built some aggs to do this, but ... it's complicated - see alerts_instance_summary_from_event_log.ts in PR #89681.

The "stateless" events were easy to implement, as none of the *-instance documents had to know anything about the previous state, they just wrote what they knew at the time.However, this greatly complicates trying to calculate these date ranges, which we show in the alert details page. We optimized writing the documents, and made it really hard to pull useful information back out.

It struck me that we are not using the event fields start, stop, and duration for these *-instance events, since they are point-in-time events. The action and alert execute events however, do use those fields to document the execution time of the alert and action type executors.

Maybe we should start using those fields for the *-instance events as well?

The new-instance event would not use those fields, but the active-instance and recovered-instance could. Both active-instance and recovered-instance could store the timestamp of the new-instance event in start, and then duration would essentially be currentTime - start. The recovered-instance event could certainly store the end date as well, but not sure about putting that in active-instance, since the active state has not really "ended" yet - but that's just semantics. Would it be weird to have start and duration but no end? It may also be confusing to have the duration in active-instance, since it's really just the "duration relative to the event's timestamp", and so would be changing for every subsequent active-instance document. But would be very useful to have.

It seems like this would make calculation of the data for the alert details page a lot easier, since it wouldn't involve having to do searches over the new-instance documents at all.

It would also be more useful when accessing the event log via Discover or Lens, since the duration is available for the interesting events, without having to search for earlier new-instance events.

I think this would involve storing the new-instance timestamp in the instance state, which I believe is typed here: alert_instance.ts. Which seems straight-forward. We would need to deal with migration issues - older events and older instance state won't have these fields, so we can't rely on them ALWAYS being there.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-04T22:49:21Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-03-30T11:30:55Z

One of the thoughts from RAC is to be able to uniquely identify a "span" of active-instance events - not clear what the id should be, probably a UUID. Although it could be as simple as instanceId + date, which should be unique. Presumably this would be an additional "field" we'd make available in the context and alert instance state.

pmuellr added discuss Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:EventLog labels Mar 4, 2021

pmuellr mentioned this issue Mar 16, 2021

[Alerting] getAlertStatus() should exhaustively search through event log #74860

Open

pmuellr mentioned this issue Mar 31, 2021

[RAC] Rule registry plugin #95903

Merged

2 tasks

This was referenced Apr 23, 2021

[Discuss] Alerting - determining time length of firing actions #55734

Closed

[Discuss] Alert only after the metrics threshold is met X times - Customer request #89152

Closed

pmuellr mentioned this issue May 4, 2021

[RAC][Epic] Observability of the alerting framework phase 1 #98902

Closed

ymao1 self-assigned this Jun 1, 2021

ymao1 mentioned this issue Jun 7, 2021

[Alerting][Event log] Persisting duration information for active alerts in event log #101387

Merged

1 task

ymao1 mentioned this issue Jun 9, 2021

[Alerting][Event Log] Consider adding uuid to active alert spans #101749

Closed

ymao1 closed this as completed in #101387 Jun 9, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discuss] extending event log for faster/easier access to active instance date information #93704

[discuss] extending event log for faster/easier access to active instance date information #93704

pmuellr commented Mar 4, 2021

elasticmachine commented Mar 4, 2021

pmuellr commented Mar 30, 2021

[discuss] extending event log for faster/easier access to active instance date information #93704

[discuss] extending event log for faster/easier access to active instance date information #93704

Comments

pmuellr commented Mar 4, 2021

elasticmachine commented Mar 4, 2021

pmuellr commented Mar 30, 2021