Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alert Summaries] Add a data provider so the rule registry can provide alerts over a time span (or past execution) #143374

Closed
Tracked by #143200
ersin-erdal opened this issue Oct 14, 2022 · 10 comments · Fixed by #143466
Assignees
Labels
Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ersin-erdal
Copy link
Contributor

ersin-erdal commented Oct 14, 2022

Meta: #143200

In order to add the alerts (as an output of Alert Summaries feature) to the action context we need to fetch the alerts from alert-as-data index. As the rule-registry already writes the alert-as-data we add a new provider to read the data as well.

We need a function attached to the rule type definition (automated by rule registry) that provides a search function to gather a series of alerts created on the last run or over a certain time span. The search function should be able to return new, ongoing and recovered alerts for a given time (last run or date range).

See #143200 for terminology of new, ongoing and recovered to help build queries.

@mikecote
Copy link
Contributor

I'm hoping we have the necessary date fields in the alerts-as-data indices to make these queries:

  • New: based on the date created
  • Ongoing: based on the last detected
  • Recovered: based on the recovery date

@ymao1
Copy link
Contributor

ymao1 commented Oct 18, 2022

The rule registry already provides a RuleDataClient with a getReader() function that returns a search() function that takes a query. The search is pre-scoped to the alerts as data indices associated with the rule data client. I would like to find a way to expose this search function to the alerting framework so we don't have to recreate the logic of figuring out what index name to search for a specific rule type. Then we could have some query builder functions to build specific queries to be passed into the search function. WDYT of this approach @ersin-erdal @mikecote

@ymao1
Copy link
Contributor

ymao1 commented Oct 18, 2022

It looks like the @timestamp for an alert document is set/updated to the task.startedAt time each time it is written out. So new alerts will set @timestamp: task.startedAt and then if the alert is active in the next run (ongoing), the @timestamp will get updated.

New alerts set kibana.alert.start to the @timestamp value.
Ongoing alerts keep the kibana.alert.start field the same
Recovered alerts set a kibana.alert.end to the @timestamp value

With this combination of timestamps, I think we have enough to query for what we need.

I think for the previous run, this should be pretty straightfoward.

For an arbitrary time range, we should be able to query the @timestamp using that range to get all alerts created/updated during that time range. Then based on the kibana.alert.start time, we can determine whether the alert was new or ongoing during that time range. Using the kibana.alert.end time, we can determine whether the alert recovered during that time range.

Does that align with the requirements?

@mikecote
Copy link
Contributor

mikecote commented Oct 18, 2022

The rule registry already provides a RuleDataClient with a getReader() function that returns a search() function that takes a query. The search is pre-scoped to the alerts as data indices associated with the rule data client. I would like to find a way to expose this search function to the alerting framework so we don't have to recreate the logic of figuring out what index name to search for a specific rule type. Then we could have some query builder functions to build specific queries to be passed into the search function. WDYT of this approach @ersin-erdal @mikecote

I'm good with whatever is simple. When the FAAD is in place, we'll be removing this code so whatever is easier will help us. In regards to finding a way to expose this searching to the framework, I'll post my idea but don't feel obligated to use it:

  • Add an optional function to the TypeScript rule type definition (ex: getSummarizedAlerts(startTime, endTime))
  • In the rule registry, define the function to get the summarized alerts based on a given time range (or other necessary params) and apply it to all rule types (persistent and lifecycle). This approach allows to keep the RuleDataClient / search / query building within the rule registry for now.
  • Alert summaries only work for rule types that provide this function for now (i.e. those using rule registry)
  • The function returns three arrays (new, ongoing, recovered) where:
    • new has a kibana.alert.start within the range
    • ongoing has a kibana.alert.start before the startTime and no value for kibana.alert.end
    • recovered has a kibana.alert.end within the range
    • *duplicates among them is ok (ex: alert is in the "new" and "recovered" arrays)
  • *endTime may not be necessary for a parameter since the functionality always looks back from "now"

For an arbitrary time range, we should be able to query the @timestamp using that range to get all alerts created/updated during that time range. Then based on the kibana.alert.start time, we can determine whether the alert was new or ongoing during that time range. Using the kibana.alert.end time, we can determine whether the alert recovered during that time range.

Does that align with the requirements?

Makes sense to me, it matches the mental model I have.

@ersin-erdal
Copy link
Contributor Author

Should we think about capping the data?
e.g. Don't report more than 2000 alerts. if yes, we also need to return the real number of alerts.

@ymao1
Copy link
Contributor

ymao1 commented Oct 19, 2022

With the alert circuit breaker in place which defaults to 1000, I believe the most we should return should be 1000 new/active and 1000 recovered.

@ymao1
Copy link
Contributor

ymao1 commented Oct 20, 2022

@ersin-erdal I wasn't thinking about the query per time range, which can return many more alerts.

Right now, I'm querying based on time range and then splitting up the results into new/ongoing/recovered. If we're capping the number of results returned but still want the actual number of new/ongoing/recovered, we would have to push the condition to ES via a scripted query. Are we ok with scripted queries? They tend to be less performant.

cc @mikecote

@mikecote
Copy link
Contributor

Good catch, @ymao1! I think it would be valuable to get the total count while limiting how many new, ongoing and recovered alerts we return. You can also weigh multiple queries vs scripted queries if ever one is easier and more performant. We can also apply the 1000 limit to each array (new, ongoing, recovered).

@ymao1
Copy link
Contributor

ymao1 commented Oct 26, 2022

@ersin-erdal @mikecote The lifecycle executors write out a different alert document for a single alert if it goes from active to recovered, then active again. Do we want to consider that one "new" alert?

For example, if we're querying against the last day of data and we have an alert for host-1 that became active, then recovered, then became active again in the day, we would be getting 2 new alerts (with different UUIDs) and 1 recovered alert for the host-1 alertId. If we only considered unique alertIds, we would only get 1 new and 1 recovered for that time range.

@mikecote
Copy link
Contributor

@ersin-erdal @mikecote The lifecycle executors write out a different alert document for a single alert if it goes from active to recovered, then active again. Do we want to consider that one "new" alert?

Yes, we should consider each alert separate and return both. I'm thinking this is best so the summary matches the changes that happened to the alert documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
3 participants