-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Should we retry alerting tasks that fail with Saved object not found errors #100764
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Heh, I was just poking through the alerting code, wondering in what cases we disable alerts when trying to execute them. Because I think we're starting to see more of these in heavily stressed environments, where transient networking errors occur. So, yeah, I'm in favor of just retrying these. In fact, I'm not sure what cases we have where essentially disabling an alert would be the right course of action. Note, I suspect the specific requests (get SO, or others) are being retried internally in the es client, so I don't think it makes a lot of sense to immediately retry. If an alert is going off every 5 minutes, I think it would be fine to skip this execution (it may well fail again within that 5 minute period). But if it's a once-a-day alert, I don't think we want to wait for the next execution. So wondering if we need some logic to determine whether we should retry THIS execution, or skip it and just wait for the next, presumably based on the interval. |
I added a In the meantime, as a means to help diagnose this during support, I've opened #101227 to see if we can do some better logging when these situations occur. |
I notices this issue wasn't linked to this one: #102353 In relation to my comment over there, are we thinking of taking an approach where by "Saved Object wasn't found" does not mean the SO is missing as it might be networking issues? |
Yes, current issue should cover a case where we're receiving false not found errors and deleting the tasks where we shouldn't.
|
Solved by the Kibana Core PR #107301 |
@YulNaumenko FYI: We're backporting #107301 to 7.15 but decided to hold off on the 7.14.1 backport until #105557 is done. |
Do we need to do anything on our side to handle this change? I see we're using |
@chrisronline After the changes introduced from #107301 (with a small bug fix merged today) and #108749, we throw a |
Currently if a task fails with a Saved object not found error, it is considered a non recoverable error and the task is not rescheduled.
kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts
Lines 580 to 584 in 77452e6
Recently, we've had a case where these alerting saved object not found errors were seen in the context of other saved object not found errors (indicating a wider problem than just with alerting) and the alerting SO does, in fact, exist. Disabling and renabling the alert reset the unrecoverable status and the alert started running again. Given that this can happen, should we still consider this error an unrecoverable one?
The text was updated successfully, but these errors were encountered: