Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting][Event Log] Consider adding uuid to active alert spans #101749

Closed
ymao1 opened this issue Jun 9, 2021 · 7 comments
Closed

[Alerting][Event Log] Consider adding uuid to active alert spans #101749

ymao1 opened this issue Jun 9, 2021 · 7 comments
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:EventLog impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. insight Issues related to user insight into platform operations and resilience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Jun 9, 2021

For this issue, we added start/duration/end times to the *-instance actions in the event log and considered adding a uuid to identify unique active spans for an alert. We decided to hold off after reviewing what SIEM and RAC were doing for this and how they are using event.id.

Currently, the lifecycle rule type in the rule registry is doing something similar but storing it in the kibana.rac.alert.uuid field. SIEM is using event.id to store the original source document id when a source document is copied into the signals index. When the signal generated is an aggregate over multiple source documents, the event.id field is not populated.

Given these other usages, do we want to add a uuid field to identify active alert spans? If we do, should we use the event.id field to store it? Or consolidate it with a RAC field?

@ymao1 ymao1 added Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 9, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris gmmorris added Project:ObservabilityOfAlerting Alerting team project for observability of alerting. and removed Project:ObservabilityOfAlerting Alerting team project for observability of alerting. labels Jun 30, 2021
@pmuellr
Copy link
Member

pmuellr commented Jul 20, 2021

I'm hesitant to use event.id for this, since I don't know it's purpose, and it seems fairly "global". I was thinking something in rule. or kibana.alerting; a field in rule would be the best, if we can agree on a field in there - but maybe there's no good fit there.

Currently feels like alerting should be creating the UUID for the new "span" of alerts, and then make it available to the rule registry somehow, for it's uses. Not quite sure yet how we'll thread the value through, but you can see the place the changes would go for RAC, around the following code. This is where the rule executor is actually invoked, and that code will be calling scheduleActions() - the alert UUIDs should have been generated by the time the executor has returned, and be made available to the rule registry framework.

const nextWrappedState = await wrappedExecutor({
...options,
state: state.wrapped != null ? state.wrapped : ({} as State),
services: {
...options.services,
...lifecycleAlertServices,
},
});
const currentAlertIds = Object.keys(currentAlerts);
const trackedAlertIds = Object.keys(state.trackedAlerts);
const newAlertIds = currentAlertIds.filter((alertId) => !trackedAlertIds.includes(alertId));
const allAlertIds = [...new Set(currentAlertIds.concat(trackedAlertIds))];

@mikecote mikecote added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 26, 2021
@pmuellr
Copy link
Member

pmuellr commented Jul 27, 2021

Taking another peek at this. Looks like RAC creates the UUIDs for lifecycle alerts, here:

const { alertUuid, started } = state.trackedAlerts[alertId] ?? {
alertUuid: v4(),
started: timestamp,
};

So it appears the UUIDs are created after running the executor, so I think we can create/manage the UUIDs when scheduleActions() is run (need to deal with unscheduleActions() or any other mutators), and then arrange to be able to return that data in a new method on AlertServices, which could be called from the RAC wrapper. For example, something like:

interface AlertServices {
  ...
  getInstances(): Map<string, string> // key: existing alert instance ID; value: new alert instance UUID
}

@pmuellr
Copy link
Member

pmuellr commented Jul 27, 2021

Happened to remember we had a similar issue we had open a while back: #64268

For that one, we realized that some rule types were already using UUIDs as their instance ids, so we thought we should add a new "human readable" to associate with an instance. I think that ship has sailed at this point, since we now have an "official" UUID - we should continue to shoot to make the alert instance id's as human readable. But may need to revisit that over time, perhaps adding an explicit "description" to these alert instances would make sense later.

@gmmorris gmmorris added insight Issues related to user insight into platform operations and resilience estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 13, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris
Copy link
Contributor

It's worth noting that without this there is actually no way of using the span as part of a dedup key in connectors such as PagerDuty.

This means that a customer can't set up actions on a rule so that they get a new incident whenever a specific alert ID reappears (so, for instance, get a new incident whenever the CPU exceeds 90% on Host #1, rather than reopen the incident form the last time it exceeded 90%).

This feels like a relatively basic missing feature.
What do you think @arisonl & @mikecote ?

@gmmorris gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Oct 11, 2021
@mikecote
Copy link
Contributor

I agree, allowing access to some span ID would allow to mimic alerts as data on an external system, create new incidents whenever an alert comes back.

@arisonl should this even become the default dedup key? instead of {ruleId}:{alertId} it becomes {ruleId}:{spanId}?

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@pmuellr
Copy link
Member

pmuellr commented Jul 10, 2024

We have since added kibana.alert.uuid as a unique identifier of alerts from when created till they recover.

@pmuellr pmuellr closed this as completed Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:EventLog impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. insight Issues related to user insight into platform operations and resilience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

6 participants