[Alert Summaries] Add a data provider so the rule registry can provide alerts over a time span (or past execution) #143374

ersin-erdal · 2022-10-14T13:46:16Z

In order to add the alerts (as an output of Alert Summaries feature) to the action context we need to fetch the alerts from alert-as-data index. As the rule-registry already writes the alert-as-data we add a new provider to read the data as well.

We need a function attached to the rule type definition (automated by rule registry) that provides a search function to gather a series of alerts created on the last run or over a certain time span. The search function should be able to return new, ongoing and recovered alerts for a given time (last run or date range).

See #143200 for terminology of new, ongoing and recovered to help build queries.

mikecote · 2022-10-18T10:58:23Z

I'm hoping we have the necessary date fields in the alerts-as-data indices to make these queries:

New: based on the date created
Ongoing: based on the last detected
Recovered: based on the recovery date

ymao1 · 2022-10-18T15:44:17Z

The rule registry already provides a RuleDataClient with a getReader() function that returns a search() function that takes a query. The search is pre-scoped to the alerts as data indices associated with the rule data client. I would like to find a way to expose this search function to the alerting framework so we don't have to recreate the logic of figuring out what index name to search for a specific rule type. Then we could have some query builder functions to build specific queries to be passed into the search function. WDYT of this approach @ersin-erdal @mikecote

ymao1 · 2022-10-18T15:55:36Z

It looks like the @timestamp for an alert document is set/updated to the task.startedAt time each time it is written out. So new alerts will set @timestamp: task.startedAt and then if the alert is active in the next run (ongoing), the @timestamp will get updated.

New alerts set kibana.alert.start to the @timestamp value.
Ongoing alerts keep the kibana.alert.start field the same
Recovered alerts set a kibana.alert.end to the @timestamp value

With this combination of timestamps, I think we have enough to query for what we need.

I think for the previous run, this should be pretty straightfoward.

For an arbitrary time range, we should be able to query the @timestamp using that range to get all alerts created/updated during that time range. Then based on the kibana.alert.start time, we can determine whether the alert was new or ongoing during that time range. Using the kibana.alert.end time, we can determine whether the alert recovered during that time range.

Does that align with the requirements?

mikecote · 2022-10-18T16:11:55Z

The rule registry already provides a RuleDataClient with a getReader() function that returns a search() function that takes a query. The search is pre-scoped to the alerts as data indices associated with the rule data client. I would like to find a way to expose this search function to the alerting framework so we don't have to recreate the logic of figuring out what index name to search for a specific rule type. Then we could have some query builder functions to build specific queries to be passed into the search function. WDYT of this approach @ersin-erdal @mikecote

I'm good with whatever is simple. When the FAAD is in place, we'll be removing this code so whatever is easier will help us. In regards to finding a way to expose this searching to the framework, I'll post my idea but don't feel obligated to use it:

Add an optional function to the TypeScript rule type definition (ex: getSummarizedAlerts(startTime, endTime))
In the rule registry, define the function to get the summarized alerts based on a given time range (or other necessary params) and apply it to all rule types (persistent and lifecycle). This approach allows to keep the RuleDataClient / search / query building within the rule registry for now.
Alert summaries only work for rule types that provide this function for now (i.e. those using rule registry)
The function returns three arrays (new, ongoing, recovered) where:
- new has a kibana.alert.start within the range
- ongoing has a kibana.alert.start before the startTime and no value for kibana.alert.end
- recovered has a kibana.alert.end within the range
- *duplicates among them is ok (ex: alert is in the "new" and "recovered" arrays)
*endTime may not be necessary for a parameter since the functionality always looks back from "now"

For an arbitrary time range, we should be able to query the @timestamp using that range to get all alerts created/updated during that time range. Then based on the kibana.alert.start time, we can determine whether the alert was new or ongoing during that time range. Using the kibana.alert.end time, we can determine whether the alert recovered during that time range.

Does that align with the requirements?

Makes sense to me, it matches the mental model I have.

ersin-erdal · 2022-10-19T15:08:51Z

Should we think about capping the data?
e.g. Don't report more than 2000 alerts. if yes, we also need to return the real number of alerts.

ymao1 · 2022-10-19T15:12:54Z

With the alert circuit breaker in place which defaults to 1000, I believe the most we should return should be 1000 new/active and 1000 recovered.

ymao1 · 2022-10-20T18:35:40Z

@ersin-erdal I wasn't thinking about the query per time range, which can return many more alerts.

Right now, I'm querying based on time range and then splitting up the results into new/ongoing/recovered. If we're capping the number of results returned but still want the actual number of new/ongoing/recovered, we would have to push the condition to ES via a scripted query. Are we ok with scripted queries? They tend to be less performant.

cc @mikecote

mikecote · 2022-10-20T20:55:55Z

Good catch, @ymao1! I think it would be valuable to get the total count while limiting how many new, ongoing and recovered alerts we return. You can also weigh multiple queries vs scripted queries if ever one is easier and more performant. We can also apply the 1000 limit to each array (new, ongoing, recovered).

ymao1 · 2022-10-26T18:09:29Z

@ersin-erdal @mikecote The lifecycle executors write out a different alert document for a single alert if it goes from active to recovered, then active again. Do we want to consider that one "new" alert?

For example, if we're querying against the last day of data and we have an alert for host-1 that became active, then recovered, then became active again in the day, we would be getting 2 new alerts (with different UUIDs) and 1 recovered alert for the host-1 alertId. If we only considered unique alertIds, we would only get 1 new and 1 recovered for that time range.

mikecote · 2022-10-27T09:25:38Z

@ersin-erdal @mikecote The lifecycle executors write out a different alert document for a single alert if it goes from active to recovered, then active again. Do we want to consider that one "new" alert?

Yes, we should consider each alert separate and return both. I'm thinking this is best so the summary matches the changes that happened to the alert documents.

This was referenced Oct 14, 2022

[meta] Alert Summaries #143200

Open

[Alert Summaries] Add summary capabilities to the API and execution logic #143376

Closed

mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Oct 17, 2022

ymao1 self-assigned this Oct 17, 2022

ymao1 moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Oct 17, 2022

mikecote mentioned this issue Oct 20, 2022

Limit the alerts-as-data fields available for alert summaries #143741

Closed

ymao1 mentioned this issue Oct 31, 2022

[Response Ops][Rule Registry] Add data provider to retrieve new, ongoing and recovered alerts from alerts-as-data #143466

Merged

1 task

ymao1 moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Oct 31, 2022

ymao1 closed this as completed in #143466 Nov 7, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alert Summaries] Add a data provider so the rule registry can provide alerts over a time span (or past execution) #143374

[Alert Summaries] Add a data provider so the rule registry can provide alerts over a time span (or past execution) #143374

ersin-erdal commented Oct 14, 2022 •

edited by mikecote

Loading

mikecote commented Oct 18, 2022

ymao1 commented Oct 18, 2022

ymao1 commented Oct 18, 2022

mikecote commented Oct 18, 2022 •

edited

Loading

ersin-erdal commented Oct 19, 2022

ymao1 commented Oct 19, 2022

ymao1 commented Oct 20, 2022

mikecote commented Oct 20, 2022

ymao1 commented Oct 26, 2022 •

edited

Loading

mikecote commented Oct 27, 2022

[Alert Summaries] Add a data provider so the rule registry can provide alerts over a time span (or past execution) #143374

[Alert Summaries] Add a data provider so the rule registry can provide alerts over a time span (or past execution) #143374

Comments

ersin-erdal commented Oct 14, 2022 • edited by mikecote Loading

mikecote commented Oct 18, 2022

ymao1 commented Oct 18, 2022

ymao1 commented Oct 18, 2022

mikecote commented Oct 18, 2022 • edited Loading

ersin-erdal commented Oct 19, 2022

ymao1 commented Oct 19, 2022

ymao1 commented Oct 20, 2022

mikecote commented Oct 20, 2022

ymao1 commented Oct 26, 2022 • edited Loading

mikecote commented Oct 27, 2022

ersin-erdal commented Oct 14, 2022 •

edited by mikecote

Loading

mikecote commented Oct 18, 2022 •

edited

Loading

ymao1 commented Oct 26, 2022 •

edited

Loading