Retry reads when ES unavailable #883

lykkin · 2021-11-16T16:57:05Z

What is the problem this PR solves?

Fleet server will now gracefully handle situations where it can't issue reads to Elasticsearch. Issue detailed in https://github.com/elastic/obs-dc-team/issues/627.

How does this PR solve the problem?

Wraps the calls responsible for reading from ES in logic that will detect expected availability errors (specifically parsing the error message with a regex) and retry up to a maximum number of times with a linearly growing backoff.

Currently the backoff grows in 30 second increments, up to 5 minutes, and it will attempt 20 times before passing the error back to the caller.

How to test this PR locally

There are unit tests for the new code.
To run a manual integration test: run an ES instance locally, and wire the fleet server up to it. After things have settled, kill the ES instance; log lines explaining the retry actions should appear in fleet server.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works

Related issues

Closes https://github.com/elastic/obs-dc-team/issues/627

elasticmachine · 2021-11-16T17:00:45Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2021-11-30T08:03:59.610+0000
Duration: 9 min 51 sec
Commit: 97524dc

Test stats 🧪

Test	Results
Failed	0
Passed	229
Skipped	0
Total	229

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

scunningham · 2021-11-17T14:18:37Z

@lykkin and I discussed this in Slack. Going to implement the application level retires here and defer the Bulk engine retry logic until we have a better use case.

ruflin · 2021-11-17T15:45:15Z

internal/pkg/bulk/retry.go

+
+type retryActionT func() respT
+
+func RetryWithBackoff(action retryActionT, requestRetryLimit int, shouldRetry RequestRetryPredicateT, requestRetryDelayIncrement time.Duration) respT {


Do we need a requestRetryLimit? What happens after the limit is reached?

As mentioned above, we are going to solve this problem higher up for now. In the above code, when retry limit it hit the function will error out, and you would still have to handle it upstream.

internal/pkg/coordinator/monitor.go

ruflin · 2021-11-24T12:09:15Z

I want to add here some more general thoughts around error handling in fleet-server. This PR might not solve all of it but we should make sure it is heading in this direction.

fleet-server is a service. In case of errors it should be able to handle it and recover from it eventually. There are 2 types of errors: The known ones and the unknown ones. For the known ones we can have special handling in place to improve the situation but this will not always be possible. For the unknown ones there are 2 options, keep waiting and retrying or stop the service. Stopping the service is radical and it assumes elastic-agent will start a fresh instance of fleet-server. Likely the better option is to just keep waiting and logging what the error is so it can be fixed.

The state we should never end up in is that fleet-server is in a deadlock and the only way out is the user restarting it.

…added additional logging

ruflin

Overall change LGTM but left some comments related to logging. In the two error cases, we know continue and keep retrying which I think is good.

As a follow up (separate PR), there is one case I stumbled over on line 114:

	// Ensure leadership on startup
	err = m.ensureLeadership(ctx)
	if err != nil {
		return err
	}

It seems here we could also end up in a limbo state? Should we fully bail out here? Lets first get the changes in you made and then discuss the above as a follow up.

ruflin · 2021-11-25T08:00:40Z

internal/pkg/coordinator/monitor.go

@@ -157,6 +158,7 @@ func (m *monitorT) handlePolicies(ctx context.Context, hits []es.HitT) error {
 		var policy model.Policy
 		err := hit.Unmarshal(&policy)
 		if err != nil {
+			m.log.Debug().Err(err).Msg("Failed to deserialize policy json")


What is the reason you put this on Debug level? Do you expect this to happen often?

ruflin · 2021-11-25T08:05:08Z

internal/pkg/coordinator/monitor.go

 	if err != nil {
+		m.log.Debug().Err(err).Str("eshost", m.hostMetadata.Name).Msg("Failed to ")
 		return err


If this error happens, an error on the info level will be logged anyways. Why not directly return the error with the info from the debug log message inside so that also on the info level we get the elasticsearch host information. With this you can remove the Debug log level.

i figured additional information could be placed at specific error points with more information, returning a formatted error would achieve the same thing though. i'll update the pr.

ruflin · 2021-11-25T08:06:06Z

internal/pkg/coordinator/monitor.go

@@ -204,6 +209,7 @@ func (m *monitorT) ensureLeadership(ctx context.Context) error {
 			m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error())
 			return nil
 		}
+		m.log.Debug().Err(err).Msg("Encountered error while querying policies")


Same here as above, it will be logged anyways so you can skip the debug log message and enahnce the error itself.

ruflin · 2021-11-25T08:06:11Z

internal/pkg/coordinator/monitor.go

@@ -214,6 +220,7 @@ func (m *monitorT) ensureLeadership(ctx context.Context) error {
 		leaders, err = dl.SearchPolicyLeaders(ctx, m.bulker, ids, dl.WithIndexName(m.leadersIndex))
 		if err != nil {
 			if !errors.Is(err, es.ErrIndexNotFound) {
+				m.log.Debug().Err(err).Msg("Encountered error while fetching policy leaders")


We don't have a good place yet to document expectations around services run by Elastic Agent. I'm starting this document to have a place to add more content to it but I expect long term we need to figure out a better place. This guideline comes out of recent issues we had in Cloud and local setups of fleet-server (elastic/fleet-server#883). We never set clear guidlines on what the expectation is of a service run by Elastic Agent and Elastic Agent itself. This PR is kick off the discussion.

lykkin · 2021-11-28T22:17:36Z

Overall change LGTM but left some comments related to logging. In the two error cases, we know continue and keep retrying which I think is good.

As a follow up (separate PR), there is one case I stumbled over on line 114:
	// Ensure leadership on startup
	err = m.ensureLeadership(ctx)
	if err != nil {
		return err
	}
It seems here we could also end up in a limbo state? Should we fully bail out here? Lets first get the changes in you made and then discuss the above as a follow up.

the block at 114 is done outside the polling loop (i.e. on start up). it seems like bailing out on start up would be reasonable, since it is more likely to be a misconfiguration.

internal/pkg/coordinator/monitor.go

ruflin · 2021-11-29T12:39:12Z

@lykkin Lets take the discussion around m.ensureLeadership(ctx) to a separate thread.

ruflin

Change LGTM. Lets get this into master and 8.0, do some manual testing with the full build and then backport to 7.16.

@EricDavisX As soon as this gets in, it would be great if the team could do some test with it. The way to confirm this works as expected is setting up an Elastic Agent with fleet-server and temporarly cut the connection to Elasticsearch and let but it up again. Everything should go back to normal afterwards again.

We don't have a good place yet to document expectations around services run by Elastic Agent. I'm starting this document to have a place to add more content to it but I expect long term we need to figure out a better place. This guideline comes out of recent issues we had in Cloud and local setups of fleet-server (elastic/fleet-server#883). We never set clear guidlines on what the expectation is of a service run by Elastic Agent and Elastic Agent itself. This PR is kick off the discussion. Co-authored-by: Gil Raphaelli <g@raphaelli.com>

EricDavisX · 2021-11-30T17:13:08Z

Agree on testing and merging backports *after we've seen it working in 8.0 (ideally).

@dikshachauhan-qasource @amolnater-qasource can you track this and hit it with the next snapshot after merge?

to expand on Nicolas' test info... I'd say we should do it on cloud as well as on-prem. And in both cases, validate the less-than 5 minute long outage and reconnection and the greater-than 5 minutes off-line reconnect.

* keep trucking on ES availability errors; more tests to come (cherry picked from commit 7fb0138) * don't attempt to distinguish between errors, just keep retrying (cherry picked from commit 2c75552) * move error blackholing up the stack so the monitor will never crash, added additional logging (cherry picked from commit f5fead9) * pr feedback (cherry picked from commit 1886dc5) * upped logging level, properly wrapped errors (cherry picked from commit 97524dc) Co-authored-by: bryan <bclement01@gmail.com>

amolnater-qasource · 2021-12-01T09:03:37Z

Hi @EricDavisX
Sure we will run this test whenever new 8.0 Snapshot build will be available.

Thanks

EricDavisX · 2021-12-01T20:44:56Z

It is testable now on 8.0 and 7.16 - Please give it a shot!

…

On Wed, Dec 1, 2021 at 4:03 AM Amol Nater ***@***.***> wrote: Hi @EricDavisX <https://github.com/EricDavisX> Sure we will run this test whenever new 8.0 Snapshot build will be available. Thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#883 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADC6TBOYN3JP6WX7PXPEP2DUOXQHHANCNFSM5IEYRCIQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

amolnater-qasource · 2021-12-02T12:23:52Z

Hi @ruflin @EricDavisX
We have revalidated this issue on latest self-managed 7.16.0 Snapshot.
Steps performed and Observations:

Setup 7.16.0 Snapshot self-managed environment.
Installed Fleet Server and observed it running Healthy.
Terminated elasticsearch.bat and kept it off for 5-7 minutes for first test (Second test for 25-30 minutes) .
Restarted elasticsearch.bat and observed below error log:

05:40:17.965
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest

For detailed logs, please find below attached debug level logs for self-managed test:
logs.zip

We have done this test for cloud kibana too:

We restarted the deployment and observed error logs under Logs tab.
No error logs were observed on restarting kibana deployment.

Further we restarted agent and restarted machine, however we assume that we didn't get expected error logs.

Query: Is there any way than restarting deployment to stop elasticsearch on a cloud build?

Logs for cloud build elastic-agent are shared below:
logs.zip

Build details 7.16.0 self-managed and cloud Kibana:
Build: 46184
Commit: dfeddea20881915a2654795def35043f6fdc3fc7
Artifact Link: https://snapshots.elastic.co/7.16.0-80ab1dd0/summary-7.16.0-SNAPSHOT.html

Please let us know if we are missing anything.
Thanks

ruflin · 2021-12-02T14:33:12Z

@amolnater-qasource I tested the above on prem today myself. I'm wondering how long you waited for it to become available again. I had the interesting case that the Elastic Agents without fleet-server took about 5 min to show up again and the one with fleet-server took around 10min.

jlind23 · 2021-12-02T14:41:17Z

@lykkin could you chime in please?
It does seem to be linked to the "exponential" back off delay so nothing weird for me but I would rather have your acknowledgement.

amolnater-qasource · 2021-12-03T07:37:35Z

Hi @ruflin
We have tested this upon a Fleet Server, in which we have added integrations like System and Endpoint Security(including Fleet Server).
We are blocked to test multiple agents installed on a fleet-server on self-managed kibana because of reported issue: elastic/beats#28152

Further details for fleet-server(above described) shared below:

S.No.	ES off duration	Fleet-Server back Healthy
1	5-7 minutes	4-5 minutes
2	25-30 minutes	5-7 minutes

Please let us know if anything else is required from our end.
Thanks

ruflin · 2021-12-03T14:33:21Z

@amolnater-qasource Not sure if there is a misunderstanding. When you state above "revalidate issue" do you mean it still exists or it is resolved? My initial understanding was that it still exists even though the fix made it in. But now in the table you show after which time fleet-server is back healthy, so it means it recovers as expected?

amolnater-qasource · 2021-12-06T09:11:25Z

Hi @ruflin

When you state above "revalidate issue" do you mean it still exists or it is resolved.

We meant to re-test this ticket, for which we found that Fleet server come back healthy after stopping and starting elasticsearch for the shared duration.

My initial understanding was that it still exists even though the fix made it in.

We were not sure about what kind of retry error logs from fleet-server we expect which should be there when elasticsearch is stopped.
For this reason we shared the logs at #883 (comment)

the table you show after which time fleet-server is back healthy, so it means it recovers as expected?

Yes fleet-server came back Healthy and working as expected.

Thanks

jlind23 · 2021-12-07T14:37:35Z

@lykkin could you also enhance this PR by adding a more appropriate error message as stated by @ruflin?

lykkin self-assigned this Nov 16, 2021

ruflin requested a review from scunningham November 16, 2021 20:04

ruflin reviewed Nov 17, 2021

View reviewed changes

keep trucking on ES availability errors; more tests to come

7fb0138

lykkin force-pushed the unavailable-reads branch from 888bf4a to 7fb0138 Compare November 22, 2021 04:56

lykkin requested a review from ruflin November 22, 2021 17:42

ruflin reviewed Nov 23, 2021

View reviewed changes

internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved

internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved

don't attempt to distinguish between errors, just keep retrying

2c75552

lykkin force-pushed the unavailable-reads branch from 62853c6 to 2c75552 Compare November 24, 2021 17:44

move error blackholing up the stack so the monitor will never crash, …

f5fead9

…added additional logging

ruflin removed the backport-v7.15.0 Automated backport with mergify label Nov 25, 2021

ruflin reviewed Nov 25, 2021

View reviewed changes

ruflin mentioned this pull request Nov 25, 2021

[Elastic Agent] Add process error handling guidelines elastic/beats#29152

Merged

pr feedback

1886dc5

michalpristas reviewed Nov 29, 2021

View reviewed changes

internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved

internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved

upped logging level, properly wrapped errors

97524dc

ruflin approved these changes Nov 30, 2021

View reviewed changes

lykkin merged commit 1fb5d6b into elastic:master Nov 30, 2021

This was referenced Nov 30, 2021

[8.0](backport #883) Retry reads when ES unavailable #936

Merged

[7.16](backport #883) Retry reads when ES unavailable #937

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry reads when ES unavailable #883

Retry reads when ES unavailable #883

lykkin commented Nov 16, 2021 •

edited by jlind23

Loading

elasticmachine commented Nov 16, 2021 •

edited

Loading

Build stats

Test stats 🧪

scunningham commented Nov 17, 2021

ruflin Nov 17, 2021

scunningham Nov 17, 2021 •

edited

Loading

ruflin commented Nov 24, 2021

ruflin left a comment

ruflin Nov 25, 2021

ruflin Nov 25, 2021

lykkin Nov 28, 2021

ruflin Nov 25, 2021

ruflin Nov 25, 2021

lykkin commented Nov 28, 2021

ruflin commented Nov 29, 2021

ruflin left a comment

EricDavisX commented Nov 30, 2021

amolnater-qasource commented Dec 1, 2021

EricDavisX commented Dec 1, 2021 via email

amolnater-qasource commented Dec 2, 2021

ruflin commented Dec 2, 2021

jlind23 commented Dec 2, 2021

amolnater-qasource commented Dec 3, 2021

ruflin commented Dec 3, 2021

amolnater-qasource commented Dec 6, 2021

jlind23 commented Dec 7, 2021


		type retryActionT func() respT

		func RetryWithBackoff(action retryActionT, requestRetryLimit int, shouldRetry RequestRetryPredicateT, requestRetryDelayIncrement time.Duration) respT {

Retry reads when ES unavailable #883

Retry reads when ES unavailable #883

Conversation

lykkin commented Nov 16, 2021 • edited by jlind23 Loading

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Checklist

Related issues

elasticmachine commented Nov 16, 2021 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

scunningham commented Nov 17, 2021

ruflin Nov 17, 2021

Choose a reason for hiding this comment

scunningham Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

ruflin commented Nov 24, 2021

ruflin left a comment

Choose a reason for hiding this comment

ruflin Nov 25, 2021

Choose a reason for hiding this comment

ruflin Nov 25, 2021

Choose a reason for hiding this comment

lykkin Nov 28, 2021

Choose a reason for hiding this comment

ruflin Nov 25, 2021

Choose a reason for hiding this comment

ruflin Nov 25, 2021

Choose a reason for hiding this comment

lykkin commented Nov 28, 2021

ruflin commented Nov 29, 2021

ruflin left a comment

Choose a reason for hiding this comment

EricDavisX commented Nov 30, 2021

amolnater-qasource commented Dec 1, 2021

EricDavisX commented Dec 1, 2021 via email

amolnater-qasource commented Dec 2, 2021

ruflin commented Dec 2, 2021

jlind23 commented Dec 2, 2021

amolnater-qasource commented Dec 3, 2021

ruflin commented Dec 3, 2021

amolnater-qasource commented Dec 6, 2021

jlind23 commented Dec 7, 2021

lykkin commented Nov 16, 2021 •

edited by jlind23

Loading

elasticmachine commented Nov 16, 2021 •

edited

Loading

scunningham Nov 17, 2021 •

edited

Loading