Action on status-change failing in Logs alerts #86507

arisonl · 2020-12-18T19:00:03Z

Kibana, Elasticsearch, Filebeat version: 7.11 BC1

Describe the bug: Setting up a phrase match log alert with "Notify every" set to "Run only on status change" causes the notification to be sent repeatedly, after the status change occurs.

Steps to reproduce:

Set up a log alert that counts phrase matches within a time window. The alert is triggered when the phrase match count exceeds a threshold.
Set up the alert to notify on state change. Set up a Slack notification "Run when" = "Fired" and a Slack notification "Run when" = "Recovered"
Cause the alert to be triggered, by producing the matching logs above the count threshold and then cause it to recover below the count phrase match threshold, by stopping producing the matching logs.
The notification will not stop firing even long after the alert has recovered.
After a while, the term that triggers the phrase match is not even present in the logs within the timewindow that the alert is looking, yet the action keeps being fired.

Expected behavior: One Slack Recovery notification once the alert recovers. No subsequent notifications until the alert is triggered again.

Screenshots (if relevant): In the following capture, I have muted the alert in order to demo it. Once unmuting it, the Recovery notification won't stop triggering.

Any additional context: Muting the alert is irrelevant to the problem, I used it to pause notifications in order for the problem to become easier to show.

elasticmachine · 2020-12-18T19:00:05Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

elasticmachine · 2020-12-18T19:00:05Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

ymao1 · 2020-12-18T21:09:37Z

This also seems to happen if you set a throttle value (even a larger value like 1h), you will still get Recovered messages at the alert schedule interval.

I do not see the same behavior for the Stack Alert Index Threshold Alert.

I think this is related to the recent addition of the built in recovered action group PR. I see in the executor function of the Logs threshold alert that there is separate handling for if an alert goes into an OK state. Maybe that needs to be updated now that there is a built in recovered action group?

gmmorris · 2020-12-30T15:51:24Z

This is happening because O11y are manually scheduling recovery of the instance - something we do not yet support.
We have a follow up PR that will make this impossible (until we can support this properly), but they kind of beat us to it and we weren't clear enough in the docs that this shouldn't be done.

I don't think we have a straight forward small solution we can fit into 7.11... my advice would be for @elastic/logs-metrics-ui to roll back that usage until we implement #87048, but it's up to them whether this is a blocker or not.

gmmorris · 2020-12-31T10:06:33Z

I've added an item to Monday's sync, but after a chat with @Zacqary sounds like O11y can live with removing the manual recovery for the 7.11 release, which should address this bug.

I'll keep the discussion point in the sync to make sure we're aligned, but I suggest we:

We remove this item from Make it Action as it'll be handled by @elastic/logs-metrics-ui in their project
We prioritize RuleTypes can't provide an AlertContext on recovery #87048 for 7.12/13

Any thoughts?

Kerry350 · 2021-01-04T15:56:03Z

@gmmorris I'm just catching up on things as I've been away for a few weeks. It looks like @Zacqary has spoken with you already, but from the logs side we can roll back this usage for now, I'll get a PR up for that tomorrow.

gmmorris · 2021-01-04T18:27:07Z

@Kerry350 yup, all good 👍

gmmorris mentioned this issue Dec 30, 2020

RuleTypes can't provide an AlertContext on recovery #87048

Closed

3 tasks

Zacqary added the Feature:Logs UI Logs UI feature label Dec 30, 2020

Zacqary added this to the Logs UI 7.11 milestone Dec 30, 2020

Kerry350 self-assigned this Jan 5, 2021

Kerry350 mentioned this issue Jan 5, 2021

[Logs UI] Fix alerts recovery #87369

Merged

Kerry350 closed this as completed in #87369 Jan 7, 2021

This was referenced Feb 10, 2021

[Metrics UI] Inventory and Threshold alerts fire Recovered action repeatedly #91035

Closed

[Alerts] Calling alertInstanceFactory and then scheduling no actions causes a recovery notification loop #91047

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Action on status-change failing in Logs alerts #86507

Action on status-change failing in Logs alerts #86507

arisonl commented Dec 18, 2020 •

edited

Loading

elasticmachine commented Dec 18, 2020

elasticmachine commented Dec 18, 2020

ymao1 commented Dec 18, 2020

gmmorris commented Dec 30, 2020 •

edited

Loading

gmmorris commented Dec 31, 2020 •

edited

Loading

Kerry350 commented Jan 4, 2021

gmmorris commented Jan 4, 2021

Action on status-change failing in Logs alerts #86507

Action on status-change failing in Logs alerts #86507

Comments

arisonl commented Dec 18, 2020 • edited Loading

elasticmachine commented Dec 18, 2020

elasticmachine commented Dec 18, 2020

ymao1 commented Dec 18, 2020

gmmorris commented Dec 30, 2020 • edited Loading

gmmorris commented Dec 31, 2020 • edited Loading

Kerry350 commented Jan 4, 2021

gmmorris commented Jan 4, 2021

arisonl commented Dec 18, 2020 •

edited

Loading

gmmorris commented Dec 30, 2020 •

edited

Loading

gmmorris commented Dec 31, 2020 •

edited

Loading