[Alerting] Should we retry alerting tasks that fail with Saved object not found errors #100764

ymao1 · 2021-05-27T11:48:38Z

Currently if a task fails with a Saved object not found error, it is considered a non recoverable error and the task is not rescheduled.

kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts

Lines 580 to 584 in 77452e6

    
           schedule: resolveErr<IntervalSchedule | undefined, Error>(schedule, (error) => { 
        
             if (isAlertSavedObjectNotFoundError(error, alertId)) { 
        
               throwUnrecoverableError(error); 
        
             } 
        
             return { interval: taskSchedule?.interval ?? FALLBACK_RETRY_INTERVAL };

Recently, we've had a case where these alerting saved object not found errors were seen in the context of other saved object not found errors (indicating a wider problem than just with alerting) and the alerting SO does, in fact, exist. Disabling and renabling the alert reset the unrecoverable status and the alert started running again. Given that this can happen, should we still consider this error an unrecoverable one?

elasticmachine · 2021-05-27T11:49:13Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-06-02T14:36:09Z

Heh, I was just poking through the alerting code, wondering in what cases we disable alerts when trying to execute them. Because I think we're starting to see more of these in heavily stressed environments, where transient networking errors occur.

So, yeah, I'm in favor of just retrying these. In fact, I'm not sure what cases we have where essentially disabling an alert would be the right course of action.

Note, I suspect the specific requests (get SO, or others) are being retried internally in the es client, so I don't think it makes a lot of sense to immediately retry. If an alert is going off every 5 minutes, I think it would be fine to skip this execution (it may well fail again within that 5 minute period). But if it's a once-a-day alert, I don't think we want to wait for the next execution. So wondering if we need some logic to determine whether we should retry THIS execution, or skip it and just wait for the next, presumably based on the interval.

pmuellr · 2021-06-02T21:39:31Z

I added a research tag, since we're not sure how to tell cases where we should retry because of transient networking issues, or if the alert SO is really gone.

In the meantime, as a means to help diagnose this during support, I've opened #101227 to see if we can do some better logging when these situations occur.

gmmorris · 2021-06-29T09:25:27Z

I notices this issue wasn't linked to this one: #102353

In relation to my comment over there, are we thinking of taking an approach where by "Saved Object wasn't found" does not mean the SO is missing as it might be networking issues?

YulNaumenko · 2021-07-01T07:58:36Z

I notices this issue wasn't linked to this one: #102353

In relation to my comment over there, are we thinking of taking an approach where by "Saved Object wasn't found" does not mean the SO is missing as it might be networking issues?

Yes, current issue should cover a case where we're receiving false not found errors and deleting the tasks where we shouldn't.
We have a proposal to fix it in two ways:

by retrying getting/updating the alert SO a few times, where we are only trying once now. This could solve for us some socket hang up, ECONNRESET issues and delays occurred during the migrations, ES | Kibana restarted, etc.
by changing the alerting code to just not delete the task record till the maximum of retries won't be reached. This is more complicated an need more research on the impacting the diagnostic logging size.

YulNaumenko · 2021-08-10T15:37:23Z

Solved by the Kibana Core PR #107301

TinaHeiligers · 2021-08-10T19:53:16Z

@YulNaumenko FYI: We're backporting #107301 to 7.15 but decided to hold off on the 7.14.1 backport until #105557 is done.

chrisronline · 2021-08-11T13:23:51Z

Do we need to do anything on our side to handle this change? I see we're using SavedObjectsErrorHelpers.isNotFoundError and I'm hoping that will automatically distinguish between an actual not found versus this new ES not available one?

TinaHeiligers · 2021-08-26T17:11:22Z

I see we're using SavedObjectsErrorHelpers.isNotFoundError and I'm hoping that will automatically distinguish between an actual not found versus this new ES not available one?

@chrisronline After the changes introduced from #107301 (with a small bug fix merged today) and #108749, we throw a 503 if we can't be sure that ES is available, so using SavedObjectsErrorHelpers.isNotFoundError should be more reliable.

botelastic bot added the needs-team Issues missing a team label label May 27, 2021

ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed needs-team Issues missing a team label labels May 27, 2021

pmuellr mentioned this issue Jun 2, 2021

[alerting] log warning when alert tasks are disabled due to saved object not found #101227

Closed

pmuellr added the research label Jun 2, 2021

YulNaumenko self-assigned this Jun 28, 2021

YulNaumenko mentioned this issue Jun 30, 2021

[Alerting] Added retry logic in the alerting task run during update rule that fails with Saved object not found error. #103925

Closed

gmmorris added the Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework label Jul 1, 2021

YulNaumenko mentioned this issue Jul 26, 2021

[Actions] Add support for the retry logic on the failed action task execution. #106771

Closed

YulNaumenko closed this as completed Aug 10, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Should we retry alerting tasks that fail with Saved object not found errors #100764

[Alerting] Should we retry alerting tasks that fail with Saved object not found errors #100764

ymao1 commented May 27, 2021

elasticmachine commented May 27, 2021

pmuellr commented Jun 2, 2021

pmuellr commented Jun 2, 2021

gmmorris commented Jun 29, 2021

YulNaumenko commented Jul 1, 2021

YulNaumenko commented Aug 10, 2021

TinaHeiligers commented Aug 10, 2021 •

edited

Loading

chrisronline commented Aug 11, 2021

TinaHeiligers commented Aug 26, 2021

[Alerting] Should we retry alerting tasks that fail with Saved object not found errors #100764

[Alerting] Should we retry alerting tasks that fail with Saved object not found errors #100764

Comments

ymao1 commented May 27, 2021

elasticmachine commented May 27, 2021

pmuellr commented Jun 2, 2021

pmuellr commented Jun 2, 2021

gmmorris commented Jun 29, 2021

YulNaumenko commented Jul 1, 2021

YulNaumenko commented Aug 10, 2021

TinaHeiligers commented Aug 10, 2021 • edited Loading

chrisronline commented Aug 11, 2021

TinaHeiligers commented Aug 26, 2021

TinaHeiligers commented Aug 10, 2021 •

edited

Loading