[Alerting] Index threshold: Actions not fired as expected #84335

felix-lessoer · 2020-11-25T14:44:27Z

Kibana version:
7.9.3

Elasticsearch version:
7.9.3

Original install method (e.g. download page, yum, from source, etc.):
on Prem

Describe the bug:
User has set up an alert that should always fire.
But the re notifying is not consitent and also the alert has been marked as resolved but there was no resolution.

Steps to reproduce:

Create an index threshold Kibana alert that always is true
Set the notify config
Check the results, e.g. in .kibana-events-log
The timing is not correct. Sometimes it takes longer than 1h to get re notified

Expected behavior:
The Re notify happens every hour at the same time

Screenshots (if relevant):

elasticmachine · 2020-11-25T14:58:32Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2020-11-25T15:09:03Z

Hi @felix-lessoer I have a couple of questions to help debug the issue:

what is the "check every" and "notify every" setting for this alert?
can you clarify what you mean by "alert has been marked as resolved"? where is this seen?
how many alerts is the deployment running?

felix-lessoer · 2020-11-25T15:48:24Z

Check every is set to 1h Notify every is set to 1h The Alert Details page showed that there was a new alert instance at about 13:xx And duration was during that time only at 10 min. So it hast created a new instance. But the condition was always true. The user is only running 1 Kibana alert but several Watchers.

mikecote · 2020-11-25T16:33:11Z

The Alert Details page showed that there was a new alert instance at about
13:xx And duration was during that time only at 10 min. So it hast created
a new instance. But the condition was always true.

We've merged a fix in 7.10 to display the true time the alert instance was created (PR: #68437). Prior to 7.10, it only shows the time from the previous execution.

notifying is not consitent

It does seem strange to see the logs not 1hr apart. Do the queries take a while to run? (say longer than 5 minutes?). Is there any logs in Kibana that could help find out the reason for this? (either alerting or task manager related).

pmuellr · 2020-12-01T14:15:59Z

7.9.x still has the "zombie" tasks issue, but I'm guessing this wouldn't be a problem with only a single alert in the system - though I don't know how many other tasks we have running - I think it's still under 10. If it was over 10, it's potentially a zombie task case.

The Alert Details page showed that there was a new alert instance at about 13:xx And duration was during that time only at 10 min

That doesn't seem right, with the "FOR THE LAST" set to 24 hours - lots of overlap with previous runs. Sorry for the late request, but any chance we could more of the event log records besides provider:alerting action:execute ones? The other provider:alerting would indicate when new/resolved instances were computed.

It's possible that it's considered "new" because the event log search didn't go back far enough, but I thought that search would go back some multiplier on the alert interval (so at least two hours).

Also interesting that the execution times seem to be 1 hour + some multiple (0-3) of five minutes; 5 minutes is our error retry? I believe errors encountered before the alert execution starts will NOT be logged in the event log, until 7.11. But there could be something in the Kibana log about these ...

felix-lessoer · 2020-12-01T14:34:36Z

Thanks. I will try to collect more info.

felix-lessoer · 2020-12-01T14:40:25Z

That doesn't seem right, with the "FOR THE LAST" set to 24 hours - lots of overlap with previous runs. Sorry for the late request, but any chance we could more of the event log records besides provider:alerting action:execute ones? The other provider:alerting would indicate when new/resolved instances were computed.

This is the full event log. There was no other entry.

ymao1 · 2020-12-03T19:38:25Z

Also interesting that the execution times seem to be 1 hour + some multiple (0-3) of five minutes; 5 minutes is our error retry? I believe errors encountered before the alert execution starts will NOT be logged in the event log, until 7.11. But there could be something in the Kibana log about these ...

Good call @pmuellr. After some local investigation with 7.9.3, it looks like an executor failure with the alert will be handled and logged appropriately in the event log and the alert will reschedule for its next scheduled interval. That doesn't look like what's happening here. The alert will be retried in 5 minutes when there is an error decrypting attributes, fetching an api key, or getting services with user permissions. Unfortunately, it doesn't look like we're doing any logging to narrow down which one of these functions are failing. This is the PR that added the fallback.

This is the relevant function:

kibana/x-pack/plugins/alerts/server/task_runner/task_runner.ts

Lines 295 to 323 in 941c66f

    
           async loadAlertAttributesAndRun(): Promise<Resultable<AlertTaskRunResult, Error>> { 
        
             const { 
        
               params: { alertId, spaceId }, 
        
             } = this.taskInstance; 
        
             const apiKey = await this.getApiKeyForAlertPermissions(alertId, spaceId); 
        
             const services = await this.getServicesWithSpaceLevelPermissions(spaceId, apiKey); 
        
             // Ensure API key is still valid and user has access 
        
             const { attributes, references } = await services.savedObjectsClient.get<RawAlert>( 
        
               'alert', 
        
               alertId 
        
             ); 
        
             return { 
        
               state: await promiseResult<AlertTaskState, Error>( 
        
                 this.validateAndExecuteAlert(services, apiKey, attributes, references) 
        
               ), 
        
               runAt: asOk( 
        
                 getNextRunAt( 
        
                   new Date(this.taskInstance.startedAt!), 
        
                   // we do not currently have a good way of returning the type 
        
                   // from SavedObjectsClient, and as we currenrtly require a schedule 
        
                   // and we only support `interval`, we can cast this safely 
        
                   attributes.schedule as IntervalSchedule 
        
                 ) 
        
               ), 
        
             }; 
        
           }

Do you know if there is one of these functions that could potentially fail one or two times and then succeed?

pmuellr · 2020-12-04T14:35:18Z

After some local investigation with 7.9.3 ... The alert will be retried in 5 minutes when there is an error decrypting attributes, fetching an api key, or getting services with user permissions. Unfortunately, it doesn't look like we're doing any logging to narrow down which one of these functions are failing.

We will be logging these in the event log anyway, in 7.11, via #82401

I wonder if we should log these in the Kibana log as well ...

Do you know if there is one of these functions that could potentially fail one or two times and then succeed?

Getting 502/504's from Kibana + ES requests, from cloud, is pretty common. Presumably a retry could work. I suspect if we look at the sum total of all the requests taking place during alert execution, there's going to be a lot of them. Running with APM instrumented is one way to see them.

Anyhoo, I'm wondering if we're hitting some of these during the "inner" requests being made, and there's no retry logic happening - I'm not sure if there's supposed to be or not, TBH. I think the "new" ES client library has built-in support for retries, but not sure what it determines whether something is retry-able - I would hope a 502/504 would be.

In this particular case, it's not cloud, but on prem, so I'd think the chances of 502/504's is going to be zero. However, it's still possible that we could be overloading something and causing similar sorts of effects.

@felix-lessoer did you have a repro of this that we could try on the main branch (what we'll be shipping in 7.11?)

felix-lessoer · 2020-12-04T15:05:58Z

Well the test was:

1.) Loading data with Beats, e.g. Metricbeat
2.) Create an alert that is always true like count of events < 9999999
3.) Check how often and when it is getting fired for at least 24h

pmuellr · 2020-12-04T19:54:22Z

Thanks @felix-lessoer.

@mikecote another thing to add to our perf tests is to capture error counts. If the sort of thing Felix is seeing is happening at that relatively slow scale, it seems like the perf tests might also see it (I'd expect even more, actually, do to the sheer number of requests taking place. I don't believe we are tracking any errors from the event log today, in the test runner.

mikecote · 2020-12-04T20:01:33Z

@mikecote another thing to add to our perf tests is to capture error counts. If the sort of thing Felix is seeing is happening at that relatively slow scale, it seems like the perf tests might also see it (I'd expect even more, actually, do to the sheer number of requests taking place. I don't believe we are tracking any errors from the event log today, in the test runner.

👌 Good point, I've added an item in #40264 to capture this information.

ymao1 · 2020-12-21T17:45:25Z

@felix-lessoer Any luck getting the event and Kibana logs for this?

felix-lessoer · 2020-12-21T18:16:39Z

I was asking the customer but it looks like after upgrading to 7.10.1 the issue do not exist anymore. So we also dont have fresh logs.

ymao1 · 2020-12-21T18:49:50Z

Are we good to close this issue then? If it crops up again, we can open a new issue?

felix-lessoer · 2020-12-21T21:22:37Z

Yes lets close it ymao1 <notifications@github.com> schrieb am Mo., 21. Dez. 2020, 19:50:

…

Are we good to close this issue then? If it crops up again, we can open a new issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84335 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AILR4F7OZOU5E4JNGSQSIIDSV6KF3ANCNFSM4UCPWR7A> .

mikecote added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Nov 25, 2020

mikecote added the Feature:Alerting label Nov 25, 2020

mikecote added the bug Fixes for quality problems that affect the customer experience label Dec 1, 2020

ymao1 self-assigned this Dec 2, 2020

ymao1 mentioned this issue Dec 7, 2020

[Alerting & Actions ] More debug logging #85149

Merged

ymao1 closed this as completed Dec 22, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Index threshold: Actions not fired as expected #84335

[Alerting] Index threshold: Actions not fired as expected #84335

felix-lessoer commented Nov 25, 2020

elasticmachine commented Nov 25, 2020

mikecote commented Nov 25, 2020

felix-lessoer commented Nov 25, 2020 via email

mikecote commented Nov 25, 2020

pmuellr commented Dec 1, 2020

felix-lessoer commented Dec 1, 2020

felix-lessoer commented Dec 1, 2020

ymao1 commented Dec 3, 2020 •

edited

Loading

pmuellr commented Dec 4, 2020

felix-lessoer commented Dec 4, 2020

pmuellr commented Dec 4, 2020

mikecote commented Dec 4, 2020

ymao1 commented Dec 21, 2020

felix-lessoer commented Dec 21, 2020

ymao1 commented Dec 21, 2020

felix-lessoer commented Dec 21, 2020 via email

[Alerting] Index threshold: Actions not fired as expected #84335

[Alerting] Index threshold: Actions not fired as expected #84335

Comments

felix-lessoer commented Nov 25, 2020

elasticmachine commented Nov 25, 2020

mikecote commented Nov 25, 2020

felix-lessoer commented Nov 25, 2020 via email

mikecote commented Nov 25, 2020

pmuellr commented Dec 1, 2020

felix-lessoer commented Dec 1, 2020

felix-lessoer commented Dec 1, 2020

ymao1 commented Dec 3, 2020 • edited Loading

pmuellr commented Dec 4, 2020

felix-lessoer commented Dec 4, 2020

pmuellr commented Dec 4, 2020

mikecote commented Dec 4, 2020

ymao1 commented Dec 21, 2020

felix-lessoer commented Dec 21, 2020

ymao1 commented Dec 21, 2020

felix-lessoer commented Dec 21, 2020 via email

ymao1 commented Dec 3, 2020 •

edited

Loading