Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry reads when ES unavailable #883

Merged
merged 5 commits into from
Nov 30, 2021
Merged

Conversation

lykkin
Copy link
Contributor

@lykkin lykkin commented Nov 16, 2021

What is the problem this PR solves?

Fleet server will now gracefully handle situations where it can't issue reads to Elasticsearch. Issue detailed in https://github.com/elastic/obs-dc-team/issues/627.

How does this PR solve the problem?

Wraps the calls responsible for reading from ES in logic that will detect expected availability errors (specifically parsing the error message with a regex) and retry up to a maximum number of times with a linearly growing backoff.

Currently the backoff grows in 30 second increments, up to 5 minutes, and it will attempt 20 times before passing the error back to the caller.

How to test this PR locally

There are unit tests for the new code.
To run a manual integration test: run an ES instance locally, and wire the fleet server up to it. After things have settled, kill the ES instance; log lines explaining the retry actions should appear in fleet server.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works

Related issues

Closes https://github.com/elastic/obs-dc-team/issues/627

@lykkin lykkin added bug Something isn't working v8.0.0 v7.16.0 v7.15.0 backport-v7.16.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify backport-v7.15.0 Automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 16, 2021
@lykkin lykkin self-assigned this Nov 16, 2021
@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 16, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-11-30T08:03:59.610+0000

  • Duration: 9 min 51 sec

  • Commit: 97524dc

Test stats 🧪

Test Results
Failed 0
Passed 229
Skipped 0
Total 229

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@ruflin ruflin requested a review from scunningham November 16, 2021 20:04
@scunningham
Copy link

@lykkin and I discussed this in Slack. Going to implement the application level retires here and defer the Bulk engine retry logic until we have a better use case.


type retryActionT func() respT

func RetryWithBackoff(action retryActionT, requestRetryLimit int, shouldRetry RequestRetryPredicateT, requestRetryDelayIncrement time.Duration) respT {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a requestRetryLimit? What happens after the limit is reached?

Copy link

@scunningham scunningham Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, we are going to solve this problem higher up for now. In the above code, when retry limit it hit the function will error out, and you would still have to handle it upstream.

@lykkin lykkin requested a review from ruflin November 22, 2021 17:42
internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved
internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved
@ruflin
Copy link
Collaborator

ruflin commented Nov 24, 2021

I want to add here some more general thoughts around error handling in fleet-server. This PR might not solve all of it but we should make sure it is heading in this direction.

fleet-server is a service. In case of errors it should be able to handle it and recover from it eventually. There are 2 types of errors: The known ones and the unknown ones. For the known ones we can have special handling in place to improve the situation but this will not always be possible. For the unknown ones there are 2 options, keep waiting and retrying or stop the service. Stopping the service is radical and it assumes elastic-agent will start a fresh instance of fleet-server. Likely the better option is to just keep waiting and logging what the error is so it can be fixed.

The state we should never end up in is that fleet-server is in a deadlock and the only way out is the user restarting it.

@ruflin ruflin removed the backport-v7.15.0 Automated backport with mergify label Nov 25, 2021
Copy link
Collaborator

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall change LGTM but left some comments related to logging. In the two error cases, we know continue and keep retrying which I think is good.

As a follow up (separate PR), there is one case I stumbled over on line 114:

	// Ensure leadership on startup
	err = m.ensureLeadership(ctx)
	if err != nil {
		return err
	}

It seems here we could also end up in a limbo state? Should we fully bail out here? Lets first get the changes in you made and then discuss the above as a follow up.

@@ -157,6 +158,7 @@ func (m *monitorT) handlePolicies(ctx context.Context, hits []es.HitT) error {
var policy model.Policy
err := hit.Unmarshal(&policy)
if err != nil {
m.log.Debug().Err(err).Msg("Failed to deserialize policy json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason you put this on Debug level? Do you expect this to happen often?

if err != nil {
m.log.Debug().Err(err).Str("eshost", m.hostMetadata.Name).Msg("Failed to ")
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this error happens, an error on the info level will be logged anyways. Why not directly return the error with the info from the debug log message inside so that also on the info level we get the elasticsearch host information. With this you can remove the Debug log level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i figured additional information could be placed at specific error points with more information, returning a formatted error would achieve the same thing though. i'll update the pr.

@@ -204,6 +209,7 @@ func (m *monitorT) ensureLeadership(ctx context.Context) error {
m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error())
return nil
}
m.log.Debug().Err(err).Msg("Encountered error while querying policies")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here as above, it will be logged anyways so you can skip the debug log message and enahnce the error itself.

@@ -214,6 +220,7 @@ func (m *monitorT) ensureLeadership(ctx context.Context) error {
leaders, err = dl.SearchPolicyLeaders(ctx, m.bulker, ids, dl.WithIndexName(m.leadersIndex))
if err != nil {
if !errors.Is(err, es.ErrIndexNotFound) {
m.log.Debug().Err(err).Msg("Encountered error while fetching policy leaders")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

ruflin added a commit to ruflin/beats that referenced this pull request Nov 25, 2021
We don't have a good place yet to document expectations around services run by Elastic Agent. I'm starting this document to have a place to add more content to it but I expect long term we need to figure out a better place.

This guideline comes out of recent issues we had in Cloud and local setups of fleet-server (elastic/fleet-server#883). We never set clear guidlines on what the expectation is of a service run by Elastic Agent and Elastic Agent itself.

This PR is kick off the discussion.
@lykkin
Copy link
Contributor Author

lykkin commented Nov 28, 2021

Overall change LGTM but left some comments related to logging. In the two error cases, we know continue and keep retrying which I think is good.

As a follow up (separate PR), there is one case I stumbled over on line 114:

	// Ensure leadership on startup
	err = m.ensureLeadership(ctx)
	if err != nil {
		return err
	}

It seems here we could also end up in a limbo state? Should we fully bail out here? Lets first get the changes in you made and then discuss the above as a follow up.

the block at 114 is done outside the polling loop (i.e. on start up). it seems like bailing out on start up would be reasonable, since it is more likely to be a misconfiguration.

internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved
internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved
@ruflin
Copy link
Collaborator

ruflin commented Nov 29, 2021

@lykkin Lets take the discussion around m.ensureLeadership(ctx) to a separate thread.

Copy link
Collaborator

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM. Lets get this into master and 8.0, do some manual testing with the full build and then backport to 7.16.

@EricDavisX As soon as this gets in, it would be great if the team could do some test with it. The way to confirm this works as expected is setting up an Elastic Agent with fleet-server and temporarly cut the connection to Elasticsearch and let but it up again. Everything should go back to normal afterwards again.

ruflin added a commit to elastic/beats that referenced this pull request Nov 30, 2021
We don't have a good place yet to document expectations around services run by Elastic Agent. I'm starting this document to have a place to add more content to it but I expect long term we need to figure out a better place.

This guideline comes out of recent issues we had in Cloud and local setups of fleet-server (elastic/fleet-server#883). We never set clear guidlines on what the expectation is of a service run by Elastic Agent and Elastic Agent itself.

This PR is kick off the discussion.


Co-authored-by: Gil Raphaelli <g@raphaelli.com>
@EricDavisX
Copy link
Contributor

Agree on testing and merging backports *after we've seen it working in 8.0 (ideally).

@dikshachauhan-qasource @amolnater-qasource can you track this and hit it with the next snapshot after merge?

to expand on Nicolas' test info... I'd say we should do it on cloud as well as on-prem. And in both cases, validate the less-than 5 minute long outage and reconnection and the greater-than 5 minutes off-line reconnect.

@lykkin lykkin merged commit 1fb5d6b into elastic:master Nov 30, 2021
mergify bot added a commit that referenced this pull request Nov 30, 2021
* keep trucking on ES availability errors; more tests to come

(cherry picked from commit 7fb0138)

* don't attempt to distinguish between errors, just keep retrying

(cherry picked from commit 2c75552)

* move error blackholing up the stack so the monitor will never crash, added additional logging

(cherry picked from commit f5fead9)

* pr feedback

(cherry picked from commit 1886dc5)

* upped logging level, properly wrapped errors

(cherry picked from commit 97524dc)

Co-authored-by: bryan <bclement01@gmail.com>
mergify bot added a commit that referenced this pull request Nov 30, 2021
* keep trucking on ES availability errors; more tests to come

(cherry picked from commit 7fb0138)

* don't attempt to distinguish between errors, just keep retrying

(cherry picked from commit 2c75552)

* move error blackholing up the stack so the monitor will never crash, added additional logging

(cherry picked from commit f5fead9)

* pr feedback

(cherry picked from commit 1886dc5)

* upped logging level, properly wrapped errors

(cherry picked from commit 97524dc)

Co-authored-by: bryan <bclement01@gmail.com>
@amolnater-qasource
Copy link
Collaborator

Hi @EricDavisX
Sure we will run this test whenever new 8.0 Snapshot build will be available.

Thanks

@EricDavisX
Copy link
Contributor

EricDavisX commented Dec 1, 2021 via email

@amolnater-qasource
Copy link
Collaborator

Hi @ruflin @EricDavisX
We have revalidated this issue on latest self-managed 7.16.0 Snapshot.
Steps performed and Observations:

  1. Setup 7.16.0 Snapshot self-managed environment.
  2. Installed Fleet Server and observed it running Healthy.
  3. Terminated elasticsearch.bat and kept it off for 5-7 minutes for first test (Second test for 25-30 minutes) .
  4. Restarted elasticsearch.bat and observed below error log:
05:40:17.965
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest

For detailed logs, please find below attached debug level logs for self-managed test:
logs.zip

We have done this test for cloud kibana too:

  • We restarted the deployment and observed error logs under Logs tab.
  • No error logs were observed on restarting kibana deployment.

Further we restarted agent and restarted machine, however we assume that we didn't get expected error logs.

Query: Is there any way than restarting deployment to stop elasticsearch on a cloud build?

Logs for cloud build elastic-agent are shared below:
logs.zip

Build details 7.16.0 self-managed and cloud Kibana:
Build: 46184
Commit: dfeddea20881915a2654795def35043f6fdc3fc7
Artifact Link: https://snapshots.elastic.co/7.16.0-80ab1dd0/summary-7.16.0-SNAPSHOT.html

Please let us know if we are missing anything.
Thanks

@ruflin
Copy link
Collaborator

ruflin commented Dec 2, 2021

@amolnater-qasource I tested the above on prem today myself. I'm wondering how long you waited for it to become available again. I had the interesting case that the Elastic Agents without fleet-server took about 5 min to show up again and the one with fleet-server took around 10min.

@jlind23
Copy link
Contributor

jlind23 commented Dec 2, 2021

@lykkin could you chime in please?
It does seem to be linked to the "exponential" back off delay so nothing weird for me but I would rather have your acknowledgement.

@amolnater-qasource
Copy link
Collaborator

Hi @ruflin
We have tested this upon a Fleet Server, in which we have added integrations like System and Endpoint Security(including Fleet Server).
We are blocked to test multiple agents installed on a fleet-server on self-managed kibana because of reported issue: elastic/beats#28152

Further details for fleet-server(above described) shared below:

S.No. ES off duration Fleet-Server back Healthy
1 5-7 minutes 4-5 minutes
2 25-30 minutes 5-7 minutes

Please let us know if anything else is required from our end.
Thanks

@ruflin
Copy link
Collaborator

ruflin commented Dec 3, 2021

@amolnater-qasource Not sure if there is a misunderstanding. When you state above "revalidate issue" do you mean it still exists or it is resolved? My initial understanding was that it still exists even though the fix made it in. But now in the table you show after which time fleet-server is back healthy, so it means it recovers as expected?

@amolnater-qasource
Copy link
Collaborator

Hi @ruflin

When you state above "revalidate issue" do you mean it still exists or it is resolved.

We meant to re-test this ticket, for which we found that Fleet server come back healthy after stopping and starting elasticsearch for the shared duration.

My initial understanding was that it still exists even though the fix made it in.

We were not sure about what kind of retry error logs from fleet-server we expect which should be there when elasticsearch is stopped.
For this reason we shared the logs at #883 (comment)

the table you show after which time fleet-server is back healthy, so it means it recovers as expected?

Yes fleet-server came back Healthy and working as expected.

Thanks

@jlind23
Copy link
Contributor

jlind23 commented Dec 7, 2021

@lykkin could you also enhance this PR by adding a more appropriate error message as stated by @ruflin?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.16.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v7.15.0 v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants