Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action on status-change failing in Logs alerts #86507

Closed
arisonl opened this issue Dec 18, 2020 · 7 comments · Fixed by #87369
Closed

Action on status-change failing in Logs alerts #86507

arisonl opened this issue Dec 18, 2020 · 7 comments · Fixed by #87369
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Logs UI Logs UI feature Project:Alerting Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Milestone

Comments

@arisonl
Copy link
Contributor

arisonl commented Dec 18, 2020

Kibana, Elasticsearch, Filebeat version: 7.11 BC1

Describe the bug: Setting up a phrase match log alert with "Notify every" set to "Run only on status change" causes the notification to be sent repeatedly, after the status change occurs.

Steps to reproduce:

  1. Set up a log alert that counts phrase matches within a time window. The alert is triggered when the phrase match count exceeds a threshold.
  2. Set up the alert to notify on state change. Set up a Slack notification "Run when" = "Fired" and a Slack notification "Run when" = "Recovered"
  3. Cause the alert to be triggered, by producing the matching logs above the count threshold and then cause it to recover below the count phrase match threshold, by stopping producing the matching logs.
  4. The notification will not stop firing even long after the alert has recovered.
  5. After a while, the term that triggers the phrase match is not even present in the logs within the timewindow that the alert is looking, yet the action keeps being fired.

Expected behavior: One Slack Recovery notification once the alert recovers. No subsequent notifications until the alert is triggered again.

Screenshots (if relevant): In the following capture, I have muted the alert in order to demo it. Once unmuting it, the Recovery notification won't stop triggering.

revovered

Any additional context: Muting the alert is irrelevant to the problem, I used it to pause notifications in order for the problem to become easier to show.

@arisonl arisonl added bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Project:Alerting labels Dec 18, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@ymao1
Copy link
Contributor

ymao1 commented Dec 18, 2020

This also seems to happen if you set a throttle value (even a larger value like 1h), you will still get Recovered messages at the alert schedule interval.

I do not see the same behavior for the Stack Alert Index Threshold Alert.

I think this is related to the recent addition of the built in recovered action group PR. I see in the executor function of the Logs threshold alert that there is separate handling for if an alert goes into an OK state. Maybe that needs to be updated now that there is a built in recovered action group?

@gmmorris
Copy link
Contributor

gmmorris commented Dec 30, 2020

This is happening because O11y are manually scheduling recovery of the instance - something we do not yet support.
We have a follow up PR that will make this impossible (until we can support this properly), but they kind of beat us to it and we weren't clear enough in the docs that this shouldn't be done.

I don't think we have a straight forward small solution we can fit into 7.11... my advice would be for @elastic/logs-metrics-ui to roll back that usage until we implement #87048, but it's up to them whether this is a blocker or not.

@Zacqary Zacqary added the Feature:Logs UI Logs UI feature label Dec 30, 2020
@Zacqary Zacqary added this to the Logs UI 7.11 milestone Dec 30, 2020
@gmmorris
Copy link
Contributor

gmmorris commented Dec 31, 2020

I've added an item to Monday's sync, but after a chat with @Zacqary sounds like O11y can live with removing the manual recovery for the 7.11 release, which should address this bug.

I'll keep the discussion point in the sync to make sure we're aligned, but I suggest we:

  1. We remove this item from Make it Action as it'll be handled by @elastic/logs-metrics-ui in their project
  2. We prioritize RuleTypes can't provide an AlertContext on recovery #87048 for 7.12/13

Any thoughts?

@Kerry350
Copy link
Contributor

Kerry350 commented Jan 4, 2021

@gmmorris I'm just catching up on things as I've been away for a few weeks. It looks like @Zacqary has spoken with you already, but from the logs side we can roll back this usage for now, I'll get a PR up for that tomorrow.

@gmmorris
Copy link
Contributor

gmmorris commented Jan 4, 2021

@Kerry350 yup, all good 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Logs UI Logs UI feature Project:Alerting Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants