Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Alerting - determining time length of firing actions #55734

Closed
pmuellr opened this issue Jan 23, 2020 · 5 comments
Closed

[Discuss] Alerting - determining time length of firing actions #55734

pmuellr opened this issue Jan 23, 2020 · 5 comments
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jan 23, 2020

We've talked about being able to determine the length of time that an alert instance has been "firing". It's not quite clear how to do this, at least in my head. Part of the complication is that we only know when an alert instance is "firing" by the alert type via:

services.alertInstanceFactory(instance).scheduleActions(groupName, data) 

If an alert instance (ie, services.alertInstanceFactory(instance)) does not call scheduleActions(), then the alert instance is considered "not firing". But we don't currently make that state accessible externally.

So, if you end up looking in the event log for what instances are firing, you'll only see log entries for "firing" cases, and there's nothing in the log when they are "not firing". We'd have to notice a "gap", which is of course hard.

@pmuellr
Copy link
Member Author

pmuellr commented Jan 23, 2020

Some references; where alert instances get created:

export function createAlertInstanceFactory(alertInstances: Record<string, AlertInstance>) {
return (id: string): AlertInstance => {
if (!alertInstances[id]) {
alertInstances[id] = new AlertInstance();
}
return alertInstances[id];
};
}

Alert Instance shape: https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/alerting/server/alert_instance/alert_instance.ts

One thought is that we can store the date when an alert instance is created, in the alert instance meta data - it looks the date we currently store is a last updated date. Not sure whether we'd change that date if the scheduled action group changed (eg, went from "warning" to "on fire!"), or not.

That would make a concept of "how long has it been firing" available in the alert instance.

We probably want something in the event log as well though. Seems like what we'd want to do is when we log an event for the scheduleActions() call, we should probably include the "start date" in that message as well. So event event log entry for them would include a date.

We could also log events for "this alert instance just started firing" and "this alert instance just stopped firing".

We could do all that :-) . Not sure what will end up work best for consumers of this data tho.

@pmuellr pmuellr added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 23, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member Author

pmuellr commented Jan 30, 2020

Had a kind of rando thought on this, that ties in with the theoretic "resolved" state we'd like to provide to clients.

That is, provide a free "resolved" action group for every alertType. An alertType won't have to use this directly, we'd use it when we realize an alertInstance we're tracking had no other actions scheduled, and we're going to remove it from our internal list. We could also allow alertTypes to use this directly, and I think in that case we'd also want to remove the alertInstance that we're tracking.

So, once we have a resolved action group in the event log for an instance id, we should be able to determine the duration, since we'd be able to use that as the "end" time (or lack of "end" time - it's still triggering!). The "start" time would be the first non-"resolved" action group after the previous "resolved" actionGroup. Not sure how easy this will be to calculate, in general, or if ES could provide a handy search for such things.

@ymao1
Copy link
Contributor

ymao1 commented Mar 4, 2021

@pmuellr Is this issue still relevant after the addition of the recovered action group?

@pmuellr
Copy link
Member Author

pmuellr commented Apr 23, 2021

Ya, this isn't really relevant anymore, for the most part. But we can apply more precision than we currently are with the event log, by adding the running duration to each active-instance event log doc we write.

See issue #93704 for more details

@pmuellr pmuellr closed this as completed Apr 23, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

4 participants