Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Write executionStatus property to kibana event log #79785

Closed
dhurley14 opened this issue Oct 6, 2020 · 6 comments · Fixed by #82401
Closed

[Alerting] Write executionStatus property to kibana event log #79785

dhurley14 opened this issue Oct 6, 2020 · 6 comments · Fixed by #82401
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@dhurley14
Copy link
Contributor

Describe the feature:

The executionStatus property on alerting saved objects (introduced here #75553) is a view into the current execution status of a kibana alert. It would be nice if each executionStatus was written to the kibana event log index .kibana-space-event-log-8.0.0 and we could query that for historical purposes.

Describe a specific use case for the feature:

The security solution currently keeps track of failures in a list-like structure of saved objects. With the addition of the executionStatus property to kibana alerts, we now have to manage merging each executionStatus into our rule status failure tracking system. It would be nice if we had a separate place to query for historical executions of kibana alerts rather than having to pull it directly off of the alert.

@dhurley14 dhurley14 added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Oct 6, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Oct 6, 2020

This is a great idea. We'll need to find a place in ECS we can put this, or add a new extension field. I think we'd want to support the status, error.reason, and error.message fields, the date is redundent as the event doc would be built at the same time the execution status is built, and the event doc already has a timestamp. But maybe be easier to duplicate the entire structure. Not sure.

@mikecote mikecote assigned mikecote and pmuellr and unassigned mikecote Oct 13, 2020
@pmuellr
Copy link
Member

pmuellr commented Oct 22, 2020

Here are the locations of some of the relevant spots in the code for this:

in run(), executionStatus is updated in the alert SO:

try {
await partiallyUpdateAlert(client, alertId, attributes, {
ignore404: true,
namespace,
});
} catch (err) {
this.logger.error(
`error updating alert execution status for ${this.alertType.id}:${alertId} ${err.message}`
);
}

in executeAlertInstances(), the event for the alert execute action is logged:

eventLogger.stopTiming(event);
event.message = `alert executed: ${alertLabel}`;
event.event = event.event || {};
event.event.outcome = 'success';
eventLogger.logEvent(event);

and the call tree looks like:

  • run()
    • loadAlertAttributesAndRun()
      • validateAndExecuteAlert()
        • executeAlertInstances()

However, loadAlertAttributesAndRun() is called in run() (and thus the event doc written) before the code is run in run() to calculate the execution status. So, will require refactoring some bits to get the execution status calculated before the event doc is written.

@pmuellr
Copy link
Member

pmuellr commented Oct 22, 2020

Ya, looking to see how to refactor to do this and ... "it's complicated". :-)

One thing that would be straight-forward to do, is to add an indication of "ok" | "active" - ie, there are no active instances | there are active instances. But at that point, it might as well be a field indicating the number of active instances, which would be 0 for alert status of ok, and > 0 for alert status of active. Provides more "precise" data.

Note that the other interesting case from the alert execution status is the error conditions, but errors will already be reported in the event anyway. It won't have the reason (like decrypt) that the alert execution status has, but it's not clear to me that it's that important.

@dhurley14 thoughts? The use case described is to get failure information from the event log. I think some "errors" won't show up today in the event log, the alerting:execute event only gets logged when the executor actually runs, so on a decrypt error, I'm guessing there won't be an event log doc currently.

If that's the case, another option is to generate a new type of event that would basically for "we wanted to run an alert, but before we could even try, there was an error, and this is what it was".

@dhurley14
Copy link
Contributor Author

dhurley14 commented Oct 28, 2020

alerting:execute event only gets logged when the executor actually runs, so on a decrypt error, I'm guessing there won't be an event log doc currently.

Yeah this is what we've noticed is the decrypt errors aren't showing up in the event log.

If that's the case, another option is to generate a new type of event that would basically for "we wanted to run an alert, but before we could even try, there was an error, and this is what it was".

I think focusing on the "errors" piece of this is the more important part from the security solution perspective. To know that there are longer running / really really long running rules that never seem to complete via the "ok / active" statuses would be great too but I think the priority is to have some queryable log of failures for the rules. We keep track within our rules of the "last five failures" which occur within the functions we run in our alert executor function as a saved object separate from the rule, but to be able to integrate historical failures from the event log + our custom "last five failures" queue would be a nice-to-have.

@pmuellr
Copy link
Member

pmuellr commented Oct 28, 2020

At this point I'm wondering if an easy route is a new event type (ie, event action) for alerts called error, which we could use to indicate errors that aren't conveniently handled by things like the execute action. It would mean looking for errors would involve a more involved search in the event log; looking for both alerting:execute docs which have an error indicator, AND alerting:error docs.

I feel like we'll need something like this eventually anyway - there are too many things outside of the execution of alerts that can have "problems" that we don't have any way of reporting on via the event log, this would be way of getting them in.

I wanna take another look at getting this into the execute action though as well - it seems like we should be able to make this work somehow, and it is associated with the execution.

pmuellr added a commit to pmuellr/kibana that referenced this issue Nov 10, 2020
resolves elastic#79785

Until now, the execution status was available in the the event
log document for the execute action.  In this PR we add it.

The event log is extended to add the following fields:

- `kibana.alerting.status` - from executionStatus.status
- `event.reason`           - from executionStatus.error.reason

The date from the executionStatus and start date in the event
log will be set to the same value.

Previously, errors encountered while trying to execute an
alert executor, eg decrypting the alert, would not end up
with an event doc generated.  Now they will.

In addition, there were a few places where events that could
have had the action group in them did not, and one where the
instance id was undefined - those were fixed up.
pmuellr added a commit that referenced this issue Nov 12, 2020
)

resolves #79785

Until now, the execution status was available in the the event
log document for the execute action.  In this PR we add it.

The event log is extended to add the following fields:

- `kibana.alerting.status` - from executionStatus.status
- `event.reason`           - from executionStatus.error.reason

The date from the executionStatus and start date in the event
log will be set to the same value.

Previously, errors encountered while trying to execute an
alert executor, eg decrypting the alert, would not end up
with an event doc generated.  Now they will.

In addition, there were a few places where events that could
have had the action group in them did not, and one where the
instance id was undefined - those were fixed up.
pmuellr added a commit to pmuellr/kibana that referenced this issue Nov 12, 2020
…stic#82401)

resolves elastic#79785

Until now, the execution status was available in the the event
log document for the execute action.  In this PR we add it.

The event log is extended to add the following fields:

- `kibana.alerting.status` - from executionStatus.status
- `event.reason`           - from executionStatus.error.reason

The date from the executionStatus and start date in the event
log will be set to the same value.

Previously, errors encountered while trying to execute an
alert executor, eg decrypting the alert, would not end up
with an event doc generated.  Now they will.

In addition, there were a few places where events that could
have had the action group in them did not, and one where the
instance id was undefined - those were fixed up.
pmuellr added a commit that referenced this issue Nov 12, 2020
) (#83289)

resolves #79785

Until now, the execution status was available in the the event
log document for the execute action.  In this PR we add it.

The event log is extended to add the following fields:

- `kibana.alerting.status` - from executionStatus.status
- `event.reason`           - from executionStatus.error.reason

The date from the executionStatus and start date in the event
log will be set to the same value.

Previously, errors encountered while trying to execute an
alert executor, eg decrypting the alert, would not end up
with an event doc generated.  Now they will.

In addition, there were a few places where events that could
have had the action group in them did not, and one where the
instance id was undefined - those were fixed up.
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants