Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Should we retry alerting tasks that fail with Saved object not found errors #100764

Closed
ymao1 opened this issue May 27, 2021 · 9 comments
Assignees
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented May 27, 2021

Currently if a task fails with a Saved object not found error, it is considered a non recoverable error and the task is not rescheduled.

schedule: resolveErr<IntervalSchedule | undefined, Error>(schedule, (error) => {
if (isAlertSavedObjectNotFoundError(error, alertId)) {
throwUnrecoverableError(error);
}
return { interval: taskSchedule?.interval ?? FALLBACK_RETRY_INTERVAL };

Recently, we've had a case where these alerting saved object not found errors were seen in the context of other saved object not found errors (indicating a wider problem than just with alerting) and the alerting SO does, in fact, exist. Disabling and renabling the alert reset the unrecoverable status and the alert started running again. Given that this can happen, should we still consider this error an unrecoverable one?

@botelastic botelastic bot added the needs-team Issues missing a team label label May 27, 2021
@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed needs-team Issues missing a team label labels May 27, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Jun 2, 2021

Heh, I was just poking through the alerting code, wondering in what cases we disable alerts when trying to execute them. Because I think we're starting to see more of these in heavily stressed environments, where transient networking errors occur.

So, yeah, I'm in favor of just retrying these. In fact, I'm not sure what cases we have where essentially disabling an alert would be the right course of action.

Note, I suspect the specific requests (get SO, or others) are being retried internally in the es client, so I don't think it makes a lot of sense to immediately retry. If an alert is going off every 5 minutes, I think it would be fine to skip this execution (it may well fail again within that 5 minute period). But if it's a once-a-day alert, I don't think we want to wait for the next execution. So wondering if we need some logic to determine whether we should retry THIS execution, or skip it and just wait for the next, presumably based on the interval.

@pmuellr
Copy link
Member

pmuellr commented Jun 2, 2021

I added a research tag, since we're not sure how to tell cases where we should retry because of transient networking issues, or if the alert SO is really gone.

In the meantime, as a means to help diagnose this during support, I've opened #101227 to see if we can do some better logging when these situations occur.

@YulNaumenko YulNaumenko self-assigned this Jun 28, 2021
@gmmorris
Copy link
Contributor

I notices this issue wasn't linked to this one: #102353

In relation to my comment over there, are we thinking of taking an approach where by "Saved Object wasn't found" does not mean the SO is missing as it might be networking issues?

@YulNaumenko
Copy link
Contributor

I notices this issue wasn't linked to this one: #102353

In relation to my comment over there, are we thinking of taking an approach where by "Saved Object wasn't found" does not mean the SO is missing as it might be networking issues?

Yes, current issue should cover a case where we're receiving false not found errors and deleting the tasks where we shouldn't.
We have a proposal to fix it in two ways:

  • by retrying getting/updating the alert SO a few times, where we are only trying once now. This could solve for us some socket hang up, ECONNRESET issues and delays occurred during the migrations, ES | Kibana restarted, etc.
  • by changing the alerting code to just not delete the task record till the maximum of retries won't be reached. This is more complicated an need more research on the impacting the diagnostic logging size.

@YulNaumenko
Copy link
Contributor

Solved by the Kibana Core PR #107301

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Aug 10, 2021

@YulNaumenko FYI: We're backporting #107301 to 7.15 but decided to hold off on the 7.14.1 backport until #105557 is done.

@chrisronline
Copy link
Contributor

Do we need to do anything on our side to handle this change? I see we're using SavedObjectsErrorHelpers.isNotFoundError and I'm hoping that will automatically distinguish between an actual not found versus this new ES not available one?

@TinaHeiligers
Copy link
Contributor

I see we're using SavedObjectsErrorHelpers.isNotFoundError and I'm hoping that will automatically distinguish between an actual not found versus this new ES not available one?

@chrisronline After the changes introduced from #107301 (with a small bug fix merged today) and #108749, we throw a 503 if we can't be sure that ES is available, so using SavedObjectsErrorHelpers.isNotFoundError should be more reliable.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
8 participants