-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry reads when ES unavailable #883
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
@lykkin and I discussed this in Slack. Going to implement the application level retires here and defer the Bulk engine retry logic until we have a better use case. |
internal/pkg/bulk/retry.go
Outdated
|
||
type retryActionT func() respT | ||
|
||
func RetryWithBackoff(action retryActionT, requestRetryLimit int, shouldRetry RequestRetryPredicateT, requestRetryDelayIncrement time.Duration) respT { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a requestRetryLimit? What happens after the limit is reached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, we are going to solve this problem higher up for now. In the above code, when retry limit it hit the function will error out, and you would still have to handle it upstream.
888bf4a
to
7fb0138
Compare
I want to add here some more general thoughts around error handling in fleet-server. This PR might not solve all of it but we should make sure it is heading in this direction. fleet-server is a service. In case of errors it should be able to handle it and recover from it eventually. There are 2 types of errors: The known ones and the unknown ones. For the known ones we can have special handling in place to improve the situation but this will not always be possible. For the unknown ones there are 2 options, keep waiting and retrying or stop the service. Stopping the service is radical and it assumes elastic-agent will start a fresh instance of fleet-server. Likely the better option is to just keep waiting and logging what the error is so it can be fixed. The state we should never end up in is that fleet-server is in a deadlock and the only way out is the user restarting it. |
62853c6
to
2c75552
Compare
…added additional logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall change LGTM but left some comments related to logging. In the two error cases, we know continue and keep retrying which I think is good.
As a follow up (separate PR), there is one case I stumbled over on line 114:
// Ensure leadership on startup
err = m.ensureLeadership(ctx)
if err != nil {
return err
}
It seems here we could also end up in a limbo state? Should we fully bail out here? Lets first get the changes in you made and then discuss the above as a follow up.
@@ -157,6 +158,7 @@ func (m *monitorT) handlePolicies(ctx context.Context, hits []es.HitT) error { | |||
var policy model.Policy | |||
err := hit.Unmarshal(&policy) | |||
if err != nil { | |||
m.log.Debug().Err(err).Msg("Failed to deserialize policy json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason you put this on Debug level? Do you expect this to happen often?
internal/pkg/coordinator/monitor.go
Outdated
if err != nil { | ||
m.log.Debug().Err(err).Str("eshost", m.hostMetadata.Name).Msg("Failed to ") | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this error happens, an error on the info level will be logged anyways. Why not directly return the error with the info from the debug log message inside so that also on the info level we get the elasticsearch host information. With this you can remove the Debug log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i figured additional information could be placed at specific error points with more information, returning a formatted error would achieve the same thing though. i'll update the pr.
internal/pkg/coordinator/monitor.go
Outdated
@@ -204,6 +209,7 @@ func (m *monitorT) ensureLeadership(ctx context.Context) error { | |||
m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error()) | |||
return nil | |||
} | |||
m.log.Debug().Err(err).Msg("Encountered error while querying policies") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here as above, it will be logged anyways so you can skip the debug log message and enahnce the error itself.
internal/pkg/coordinator/monitor.go
Outdated
@@ -214,6 +220,7 @@ func (m *monitorT) ensureLeadership(ctx context.Context) error { | |||
leaders, err = dl.SearchPolicyLeaders(ctx, m.bulker, ids, dl.WithIndexName(m.leadersIndex)) | |||
if err != nil { | |||
if !errors.Is(err, es.ErrIndexNotFound) { | |||
m.log.Debug().Err(err).Msg("Encountered error while fetching policy leaders") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
We don't have a good place yet to document expectations around services run by Elastic Agent. I'm starting this document to have a place to add more content to it but I expect long term we need to figure out a better place. This guideline comes out of recent issues we had in Cloud and local setups of fleet-server (elastic/fleet-server#883). We never set clear guidlines on what the expectation is of a service run by Elastic Agent and Elastic Agent itself. This PR is kick off the discussion.
the block at 114 is done outside the polling loop (i.e. on start up). it seems like bailing out on start up would be reasonable, since it is more likely to be a misconfiguration. |
@lykkin Lets take the discussion around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change LGTM. Lets get this into master and 8.0, do some manual testing with the full build and then backport to 7.16.
@EricDavisX As soon as this gets in, it would be great if the team could do some test with it. The way to confirm this works as expected is setting up an Elastic Agent with fleet-server and temporarly cut the connection to Elasticsearch and let but it up again. Everything should go back to normal afterwards again.
We don't have a good place yet to document expectations around services run by Elastic Agent. I'm starting this document to have a place to add more content to it but I expect long term we need to figure out a better place. This guideline comes out of recent issues we had in Cloud and local setups of fleet-server (elastic/fleet-server#883). We never set clear guidlines on what the expectation is of a service run by Elastic Agent and Elastic Agent itself. This PR is kick off the discussion. Co-authored-by: Gil Raphaelli <g@raphaelli.com>
Agree on testing and merging backports *after we've seen it working in 8.0 (ideally). @dikshachauhan-qasource @amolnater-qasource can you track this and hit it with the next snapshot after merge? to expand on Nicolas' test info... I'd say we should do it on cloud as well as on-prem. And in both cases, validate the less-than 5 minute long outage and reconnection and the greater-than 5 minutes off-line reconnect. |
* keep trucking on ES availability errors; more tests to come (cherry picked from commit 7fb0138) * don't attempt to distinguish between errors, just keep retrying (cherry picked from commit 2c75552) * move error blackholing up the stack so the monitor will never crash, added additional logging (cherry picked from commit f5fead9) * pr feedback (cherry picked from commit 1886dc5) * upped logging level, properly wrapped errors (cherry picked from commit 97524dc) Co-authored-by: bryan <bclement01@gmail.com>
* keep trucking on ES availability errors; more tests to come (cherry picked from commit 7fb0138) * don't attempt to distinguish between errors, just keep retrying (cherry picked from commit 2c75552) * move error blackholing up the stack so the monitor will never crash, added additional logging (cherry picked from commit f5fead9) * pr feedback (cherry picked from commit 1886dc5) * upped logging level, properly wrapped errors (cherry picked from commit 97524dc) Co-authored-by: bryan <bclement01@gmail.com>
Hi @EricDavisX Thanks |
It is testable now on 8.0 and 7.16 - Please give it a shot!
…On Wed, Dec 1, 2021 at 4:03 AM Amol Nater ***@***.***> wrote:
Hi @EricDavisX <https://github.com/EricDavisX>
Sure we will run this test whenever new 8.0 Snapshot build will be
available.
Thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#883 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADC6TBOYN3JP6WX7PXPEP2DUOXQHHANCNFSM5IEYRCIQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hi @ruflin @EricDavisX
For detailed logs, please find below attached debug level logs for self-managed test: We have done this test for cloud kibana too:
Further we restarted agent and restarted machine, however we assume that we didn't get expected error logs. Query: Is there any way than restarting deployment to stop elasticsearch on a cloud build? Logs for cloud build elastic-agent are shared below: Build details 7.16.0 self-managed and cloud Kibana: Please let us know if we are missing anything. |
@amolnater-qasource I tested the above on prem today myself. I'm wondering how long you waited for it to become available again. I had the interesting case that the Elastic Agents without fleet-server took about 5 min to show up again and the one with fleet-server took around 10min. |
@lykkin could you chime in please? |
Hi @ruflin Further details for fleet-server(above described) shared below:
Please let us know if anything else is required from our end. |
@amolnater-qasource Not sure if there is a misunderstanding. When you state above "revalidate issue" do you mean it still exists or it is resolved? My initial understanding was that it still exists even though the fix made it in. But now in the table you show after which time fleet-server is back healthy, so it means it recovers as expected? |
Hi @ruflin
We meant to re-test this ticket, for which we found that Fleet server come back healthy after stopping and starting elasticsearch for the shared duration.
We were not sure about what kind of retry error logs from fleet-server we expect which should be there when elasticsearch is stopped.
Yes fleet-server came back Healthy and working as expected. Thanks |
What is the problem this PR solves?
Fleet server will now gracefully handle situations where it can't issue reads to Elasticsearch. Issue detailed in https://github.com/elastic/obs-dc-team/issues/627.
How does this PR solve the problem?
Wraps the calls responsible for reading from ES in logic that will detect expected availability errors (specifically parsing the error message with a regex) and retry up to a maximum number of times with a linearly growing backoff.
Currently the backoff grows in 30 second increments, up to 5 minutes, and it will attempt 20 times before passing the error back to the caller.
How to test this PR locally
There are unit tests for the new code.
To run a manual integration test: run an ES instance locally, and wire the fleet server up to it. After things have settled, kill the ES instance; log lines explaining the retry actions should appear in fleet server.
Checklist
Related issues
Closes https://github.com/elastic/obs-dc-team/issues/627