Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert statuses #51099

Closed
mikecote opened this issue Nov 19, 2019 · 17 comments · Fixed by #75553
Closed

Alert statuses #51099

mikecote opened this issue Nov 19, 2019 · 17 comments · Fixed by #75553
Assignees
Labels
discuss Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

To enrich the user experience within the alerts table (under Kibana management section), we should display the status for each alert.

To make sure we're on the same page on what alert statuses we should have, I've opened this issue for discussion. The UI would display them as a column within the alerts table and there would be a filter for the status. The statuses would be calculated on read based on the result of a few queries (activity log, alert instances, etc).

As a starting point, the mockups contain four potential statuses:

  • Active: The alert is actively firing
  • OK: The alert is running periodically and not firing anything
  • Error: The alert is throwing errors during execution
  • No Data: I'm thinking this is when the alert didn't run yet?

Is there any proposal for different statuses?

cc @elastic/kibana-stack-services @alexfrancoeur @peterschretlen

@peterschretlen
Copy link
Contributor

I think those 4 are sufficient and having a small number is preferable.

No Data will depend on the alert type, but I think for a timeseries metric it would mean there's no data points in the period being checked (which can happen if a beat is removed or stops sending data for example)

Say for a CPU usage alert, if none of my metricbeat have send data 1 hour, and my alert is "when avg CPU is above 90% over the last 5 minutes" - there'd be no documents in elasticsearch and I would expect this to show "No Data" state.

@pmuellr
Copy link
Member

pmuellr commented Nov 20, 2019

No Data sounds like:

  • from @mikecote: the alert type function has not yet run for this alert
  • from @peterschretlen: the meaning depends on the alert type; the alert type function may have run, but not done anything "semantically" because it hasn't gotten enough data yet

Both are actually interesting, but we don't have a mechanism to allow an alert type to return a "No Data" condition as in Peter's definition, that I'm aware of.

I'd say get rid of No Data for now, or change to something like has not run yet (Mike's definition). No Data sounds a bit confusing and vague to me.

For the remaining, how do we determine these values - the last state when the alert function ran? It either threw an error (Error), ran but scheduled no actions (OK), or ran and scheduled actions (Active). Just the last state seen? If so, perhaps storing that in the alert itself would be appropriate.

Presumably things like muted, *throttled show up in a separate column/icon/property indicating those states, so that kind of state isn't appropriate for this "status".

@peterschretlen
Copy link
Contributor

If someone is authoring an alert, what do we expect them to do in the case where they don't have enough data to evaluate the condition? Throw an error? Return and treat it as normal?

No data/missing data is a pretty common scenario and I think it's an important cue. Data often arrives late, and it's not really an error but I wouldn't consider it ok either. Some systems will also let you notify on no data. Few examples:

If we don't treat it as a state here, we need account for it somewhere. I understand if we don't have a mechanism for it, but we could create one. It could be an expected type of error for example, thrown by an alert execution?

@mikecote
Copy link
Contributor Author

One option I can see to add the mechanism to handle the "no data" scenario is to change the return structure of the alert type executor.

Currently it returns something like this:

return {
  // my updated alert level state
};

and we could change it to something like this:

return {
  noData: true,
  state: {
    // my updated alert level state
  },
};

Should be fairly straightforward to do and more future proof if ever we want to return more attributes than state from the executor.

Other options instead of noData: true could be status: 'no-data' or something like that.

@mikecote
Copy link
Contributor Author

mikecote commented Nov 21, 2019

For the remaining, how do we determine these values - the last state when the alert function ran?

From how I see it, yes it would be based on the last execution / interval.

If so, perhaps storing that in the alert itself would be appropriate.

I think since we'll have a filter in the UI for statuses, it would make sense to store the status with the alert for searchability. After each execution, we would do an update on the alert document to update its status.

@pmuellr
Copy link
Member

pmuellr commented Nov 21, 2019

re: the "no data" status

It sounds like this could just be treated as an action group, for alert types that are sensitive to this. Eg, if they didn't have enough data, they'd schedule the action group "no-data", and could have whatever actions they wanted associated with that.

That would at least make that state "actionable", but wouldn't give us the ability to have it show up as a "status" value, without any kind of existing API changes, such as what Mike suggested.

If we end up making this part of the API signature, and alert status, feels like maybe "not enough data" is probably a better phrasing for this vs no data. Maybe something in the vein "inconclusive" or such ...

@pmuellr
Copy link
Member

pmuellr commented Nov 21, 2019

After each execution, we would do an update on the alert document to update its status

Ya, what I was thinking. Hopefully we can piggy-back this on top of an existing update, like the scheduling of the next run.

This also means we won't need the event log to determine that status ...

@mdefazio
Copy link
Contributor

mdefazio commented Dec 6, 2019

Posting this question here instead of slack:

If an alert is disabled, is the status then also disabled or is it the last status before it was disabled?

Perhaps just No data?

@mdefazio
Copy link
Contributor

mdefazio commented Dec 6, 2019

Also, is the warning level a status?

@peterschretlen
Copy link
Contributor

Also, is the warning level a status?

I think the status would be active (active = has one or more alert instances)?

If an alert is disabled, is the status then also disabled or is it the last status before it was disabled?

A disabled alert has no status - could it be blank? If we need a value to filter on then I think disabled as a state is ok. no data has a special meaning, I don't think it works on a disabled alert.

@peterschretlen
Copy link
Contributor

Repeating a comment from #58366 (comment): we should be able to filter alerts by their status if possible.

@pmuellr
Copy link
Member

pmuellr commented Jul 13, 2020

One thing not mentioned yet is "alert instance status". It seems like an alert instance can have most of the status values of the alert itself, except perhaps "error", since "error" indicates the alert executor ran into some problem. Note this specifically includes "no data", as some alert types may know the possible domain of their instances, and be able to determine if an instance has not produced data. But not all alerts will be able to do this - index threshold for instance doesn't know the domain of the possible groupings it uses for it's instance ids.

@pmuellr
Copy link
Member

pmuellr commented Aug 18, 2020

Just happened to think in a chat with Mike, we'll have the opportunity to "migrate" old alerts to contain data in this new executionStatus object, but how could we possibly get data to put in it? Presumably we could get some parts of it from the alert state, but I don't think you can access other SOs during a migration (seems horribly complicated!).

I think we're only talking about the status and date fields - the error field can always be null.

And it's not really important what's in the SO itself, but what we return from alertClient methods and http requests. So, do we want these to be optional? What a PITA that would be, when the only possible time they could be null is right after a migration, up until the alert function is executed for the first time after a migration.

Thinking we can have another status value of "unknown", that we can use in a case like this, and may come in handy later as well. We'll want to add a release note about this, if it ends up showing up in the UI - not sure it will or not.

I don't think we will, looking at the current web ui. But that made me realize we probably want this new status field in the alerts table view:

image

pmuellr added a commit to pmuellr/kibana that referenced this issue Sep 4, 2020
…d object

resolves elastic#51099

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

interim commits:

calculate the execution status, some refactoring
write the execution status to the alert after execution
use real date in execution status on create
add an await to an async fn
comment out status update to see if SIEM FT succeeds
fix SIEM FT alert deletion issue
use partial updates and retries in alerts clients to avoid conflicts
fix jest tests
clean up conflict-fixin code
moar conflict-prevention fixing
fix type error with find result
add reasons to alert execution errors
add some jest tests
add some function tests
fix status update to use alert namespace
fix function test
pmuellr added a commit to pmuellr/kibana that referenced this issue Sep 29, 2020
…d object

resolves elastic#51099

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

The data is added to the alert as the `executionStatus` field, with the
following shape:

```ts
interface AlertExecutionStatus {
  status: 'ok' | 'active' | 'error' | 'unknown';
  date: Date;
  error?: {
    reason: 'read' | 'decrypt' | 'execute' | 'unknown';
    message: string;
  };
}
```

interim commits:

calculate the execution status, some refactoring
write the execution status to the alert after execution
use real date in execution status on create
add an await to an async fn
comment out status update to see if SIEM FT succeeds
fix SIEM FT alert deletion issue
use partial updates and retries in alerts clients to avoid conflicts
fix jest tests
clean up conflict-fixin code
moar conflict-prevention fixing
fix type error with find result
add reasons to alert execution errors
add some jest tests
add some function tests
fix status update to use alert namespace
fix function test
finish function tests
more fixes after rebase
fix type checks and jest tests after rebase
add migration and find functional tests
fix relative import
pmuellr added a commit that referenced this issue Oct 1, 2020
…d object (#75553)

resolves #51099

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

The data is added to the alert as the `executionStatus` field, with the
following shape:

```ts
interface AlertExecutionStatus {
  status: 'ok' | 'active' | 'error' | 'pending' | 'unknown';
  lastExecutionDate: Date;
  error?: {
    reason: 'read' | 'decrypt' | 'execute' | 'unknown';
    message: string;
  };
}
```
pmuellr added a commit to pmuellr/kibana that referenced this issue Oct 1, 2020
…d object (elastic#75553)

resolves elastic#51099

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

The data is added to the alert as the `executionStatus` field, with the
following shape:

```ts
interface AlertExecutionStatus {
  status: 'ok' | 'active' | 'error' | 'pending' | 'unknown';
  lastExecutionDate: Date;
  error?: {
    reason: 'read' | 'decrypt' | 'execute' | 'unknown';
    message: string;
  };
}
```
pmuellr added a commit that referenced this issue Oct 2, 2020
…d object (#75553) (#79227)

resolves #51099

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

The data is added to the alert as the `executionStatus` field, with the
following shape:

```ts
interface AlertExecutionStatus {
  status: 'ok' | 'active' | 'error' | 'pending' | 'unknown';
  lastExecutionDate: Date;
  error?: {
    reason: 'read' | 'decrypt' | 'execute' | 'unknown';
    message: string;
  };
}
```
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants