Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Index threshold: Actions not fired as expected #84335

Closed
felix-lessoer opened this issue Nov 25, 2020 · 16 comments
Closed

[Alerting] Index threshold: Actions not fired as expected #84335

felix-lessoer opened this issue Nov 25, 2020 · 16 comments
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@felix-lessoer
Copy link

Kibana version:
7.9.3

Elasticsearch version:
7.9.3

Original install method (e.g. download page, yum, from source, etc.):
on Prem

Describe the bug:
User has set up an alert that should always fire.
But the re notifying is not consitent and also the alert has been marked as resolved but there was no resolution.

config

Steps to reproduce:

  1. Create an index threshold Kibana alert that always is true
  2. Set the notify config
  3. Check the results, e.g. in .kibana-events-log
    The timing is not correct. Sometimes it takes longer than 1h to get re notified

Expected behavior:
The Re notify happens every hour at the same time

Screenshots (if relevant):
result

@mikecote mikecote added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Nov 25, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

Hi @felix-lessoer I have a couple of questions to help debug the issue:

  • what is the "check every" and "notify every" setting for this alert?
  • can you clarify what you mean by "alert has been marked as resolved"? where is this seen?
  • how many alerts is the deployment running?

@felix-lessoer
Copy link
Author

felix-lessoer commented Nov 25, 2020 via email

@mikecote
Copy link
Contributor

The Alert Details page showed that there was a new alert instance at about
13:xx And duration was during that time only at 10 min. So it hast created
a new instance. But the condition was always true.

We've merged a fix in 7.10 to display the true time the alert instance was created (PR: #68437). Prior to 7.10, it only shows the time from the previous execution.

notifying is not consitent

It does seem strange to see the logs not 1hr apart. Do the queries take a while to run? (say longer than 5 minutes?). Is there any logs in Kibana that could help find out the reason for this? (either alerting or task manager related).

@pmuellr
Copy link
Member

pmuellr commented Dec 1, 2020

7.9.x still has the "zombie" tasks issue, but I'm guessing this wouldn't be a problem with only a single alert in the system - though I don't know how many other tasks we have running - I think it's still under 10. If it was over 10, it's potentially a zombie task case.

The Alert Details page showed that there was a new alert instance at about 13:xx And duration was during that time only at 10 min

That doesn't seem right, with the "FOR THE LAST" set to 24 hours - lots of overlap with previous runs. Sorry for the late request, but any chance we could more of the event log records besides provider:alerting action:execute ones? The other provider:alerting would indicate when new/resolved instances were computed.

It's possible that it's considered "new" because the event log search didn't go back far enough, but I thought that search would go back some multiplier on the alert interval (so at least two hours).

Also interesting that the execution times seem to be 1 hour + some multiple (0-3) of five minutes; 5 minutes is our error retry? I believe errors encountered before the alert execution starts will NOT be logged in the event log, until 7.11. But there could be something in the Kibana log about these ...

@felix-lessoer
Copy link
Author

Thanks. I will try to collect more info.

@felix-lessoer
Copy link
Author

That doesn't seem right, with the "FOR THE LAST" set to 24 hours - lots of overlap with previous runs. Sorry for the late request, but any chance we could more of the event log records besides provider:alerting action:execute ones? The other provider:alerting would indicate when new/resolved instances were computed.

This is the full event log. There was no other entry.

@mikecote mikecote added the bug Fixes for quality problems that affect the customer experience label Dec 1, 2020
@ymao1 ymao1 self-assigned this Dec 2, 2020
@ymao1
Copy link
Contributor

ymao1 commented Dec 3, 2020

Also interesting that the execution times seem to be 1 hour + some multiple (0-3) of five minutes; 5 minutes is our error retry? I believe errors encountered before the alert execution starts will NOT be logged in the event log, until 7.11. But there could be something in the Kibana log about these ...

Good call @pmuellr. After some local investigation with 7.9.3, it looks like an executor failure with the alert will be handled and logged appropriately in the event log and the alert will reschedule for its next scheduled interval. That doesn't look like what's happening here. The alert will be retried in 5 minutes when there is an error decrypting attributes, fetching an api key, or getting services with user permissions. Unfortunately, it doesn't look like we're doing any logging to narrow down which one of these functions are failing. This is the PR that added the fallback.

This is the relevant function:

async loadAlertAttributesAndRun(): Promise<Resultable<AlertTaskRunResult, Error>> {
const {
params: { alertId, spaceId },
} = this.taskInstance;
const apiKey = await this.getApiKeyForAlertPermissions(alertId, spaceId);
const services = await this.getServicesWithSpaceLevelPermissions(spaceId, apiKey);
// Ensure API key is still valid and user has access
const { attributes, references } = await services.savedObjectsClient.get<RawAlert>(
'alert',
alertId
);
return {
state: await promiseResult<AlertTaskState, Error>(
this.validateAndExecuteAlert(services, apiKey, attributes, references)
),
runAt: asOk(
getNextRunAt(
new Date(this.taskInstance.startedAt!),
// we do not currently have a good way of returning the type
// from SavedObjectsClient, and as we currenrtly require a schedule
// and we only support `interval`, we can cast this safely
attributes.schedule as IntervalSchedule
)
),
};
}

Do you know if there is one of these functions that could potentially fail one or two times and then succeed?

@pmuellr
Copy link
Member

pmuellr commented Dec 4, 2020

After some local investigation with 7.9.3 ... The alert will be retried in 5 minutes when there is an error decrypting attributes, fetching an api key, or getting services with user permissions. Unfortunately, it doesn't look like we're doing any logging to narrow down which one of these functions are failing.

We will be logging these in the event log anyway, in 7.11, via #82401

I wonder if we should log these in the Kibana log as well ...

Do you know if there is one of these functions that could potentially fail one or two times and then succeed?

Getting 502/504's from Kibana + ES requests, from cloud, is pretty common. Presumably a retry could work. I suspect if we look at the sum total of all the requests taking place during alert execution, there's going to be a lot of them. Running with APM instrumented is one way to see them.

Anyhoo, I'm wondering if we're hitting some of these during the "inner" requests being made, and there's no retry logic happening - I'm not sure if there's supposed to be or not, TBH. I think the "new" ES client library has built-in support for retries, but not sure what it determines whether something is retry-able - I would hope a 502/504 would be.

In this particular case, it's not cloud, but on prem, so I'd think the chances of 502/504's is going to be zero. However, it's still possible that we could be overloading something and causing similar sorts of effects.

@felix-lessoer did you have a repro of this that we could try on the main branch (what we'll be shipping in 7.11?)

@felix-lessoer
Copy link
Author

Well the test was:

1.) Loading data with Beats, e.g. Metricbeat
2.) Create an alert that is always true like count of events < 9999999
3.) Check how often and when it is getting fired for at least 24h

@pmuellr
Copy link
Member

pmuellr commented Dec 4, 2020

Thanks @felix-lessoer.

@mikecote another thing to add to our perf tests is to capture error counts. If the sort of thing Felix is seeing is happening at that relatively slow scale, it seems like the perf tests might also see it (I'd expect even more, actually, do to the sheer number of requests taking place. I don't believe we are tracking any errors from the event log today, in the test runner.

@mikecote
Copy link
Contributor

mikecote commented Dec 4, 2020

@mikecote another thing to add to our perf tests is to capture error counts. If the sort of thing Felix is seeing is happening at that relatively slow scale, it seems like the perf tests might also see it (I'd expect even more, actually, do to the sheer number of requests taking place. I don't believe we are tracking any errors from the event log today, in the test runner.

👌 Good point, I've added an item in #40264 to capture this information.

@ymao1
Copy link
Contributor

ymao1 commented Dec 21, 2020

@felix-lessoer Any luck getting the event and Kibana logs for this?

@felix-lessoer
Copy link
Author

I was asking the customer but it looks like after upgrading to 7.10.1 the issue do not exist anymore. So we also dont have fresh logs.

@ymao1
Copy link
Contributor

ymao1 commented Dec 21, 2020

Are we good to close this issue then? If it crops up again, we can open a new issue?

@felix-lessoer
Copy link
Author

felix-lessoer commented Dec 21, 2020 via email

@ymao1 ymao1 closed this as completed Dec 22, 2020
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

6 participants