Implement consistent retry policy when calling Elasticsearch #2063

artem-shelkovnikov · 2024-01-15T12:04:44Z

Closes https://github.com/elastic/enterprise-search-team/issues/6412

This PR makes our retry policy when calling Elasticsearch consistent between all requests to Elasticsearch.

It's possible to specify the following settings in a config file:

elasticsearch.max_retries: 5
elasticsearch.retry_interval: 10

The above setup states, that we'll retry every call (wrapped with retryable abstraction) 5 times, starting from 10 seconds on first retry. Internally, retry policy is Linear, so first retry will be 10 seconds, second retry will be 20 seconds, and so on.

For now absolutely every call to Elasticsearch will be retried like this (I wrapped all calls with the retrier).

Errors that are retried are:

Api Error with response code 429, 500, 502, 503, 504
Connection Timeout

To do the work I've introduced a TransientElasticsearchRetrier class that handles the logic.

Retry method signature is:

async def execute_with_retry(self, func):

To retry something, you need to do the following:

license_response = await self._retrier.execute_with_retry(
            self.client.license.get
        )

In case the ES call requires arguments, partial should be used:

return await self._retrier.execute_with_retry(
            partial(
                self.client.indices.create,
                index=search_index_name,
                mappings=mappings,
                settings=settings,
            )
        )

Separate class was created because:

I need to make this retry configurable, so I need to pass configuration values into it. It's not possible to do unless we expose config globally, but that's not we do, thus the class.
I'm able to cancel the retries this way - currently retries in @retryable wrapper are not cancellable, so stopping a running job that's inside retry loop will generate an error

Checklists

Pre-Review Checklist

this PR has a meaningful title
this PR links to all relevant github issues that it fixes or partially addresses
if there is no GH issue, please create it. Each PR should have a link to an issue
this PR has a thorough description
Covered the changes with automated tests
Tested the changes locally
Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
Considered corresponding documentation changes
Contributed any configuration settings changes to the configuration reference

Release Note

Improve connection resiliency when accessing elasticsearch - retry all failed calls that fail because of unhealthy Elasticsearch before terminating with an error. This logic is configurable with newly added elasticsearch.max_retries and elasticsearch.retry_interval configuration fields.

elasticsearch.bulk.max_retries was increased from 3 to 5 to match newly added elasticsearch.max_retries default value. Additionally elasticsearch.bulk.retry_interval was added with default value of 10 seconds (previously was hardcoded to 1 second) to match behaviour of newly added elasticsearch.retry_interval. As a result, bulk inserts will retry for 10 * (1 + 2 + 3 + 4 + 5) = 150 seconds, before this number was 1 * (1 + 2 + 3) = 6 seconds.

connectors/config.py

connectors/es/client.py

vidok · 2024-01-24T11:06:27Z

connectors/es/client.py

@@ -41,6 +46,10 @@ def __init__(self, config):
            use_default_ports_for_scheme=True,
        )
        self._sleeps = CancellableSleeps()
+        self._retrier = TransientElasticsearchRetrier(
+            logger, config.get("max_retries", 5), config.get("retry_timeout", 10)


I prefer to use constants like DEFAULT_MAX_RETRIES instead of the raw number. It's just easier to find and change them.

Added constants in a4d4f8b

vidok · 2024-01-24T14:22:19Z

@elastic/ingestion-team

Thinking of config options additionally - we have elasticsearch.initial_backoff, elasticsearch.backoff_multiplier and elasticsearch.max_wait_duration that is specific to Elasticsearch client wait action that happens during startup.

I'm adding elasticsearch.max_retries and elasticsearch.retry_interval that are related to all elasticsearch calls.

Does it make it confusing that these options are in the root of elasticsearch config option? Should I put the newly added options into some nested key, like elasticsearch.connection or similar?

Reposting my comment for the record (repeating your ideas :) ):

The new keys `elasticsearch.max_retries` and `elasticsearch.retry_interval` should be in the root because the other root keys already indicate they are used as Elasticsearch connection configuration. 

The old keys can be moved under another prefix because they impact a specific use case.

connectors/config.py

wangch079 · 2024-01-24T14:13:33Z

connectors/es/client.py

@@ -162,6 +165,81 @@ async def wait(self):
        await self.close()
        return False

+    async def ping(self):
+        try:
+            await self.client.info()


Why ping is not retried?

It's not retried cause of wait method - wait calls ping and does retrying based on its own config options. See method:

async def wait(self): backoff = self.initial_backoff_duration # ... while time.time() - start < self.max_wait_duration: if not self._keep_waiting: await self.close() return False # ... if await self.ping(): return True await self._sleeps.sleep(backoff) backoff *= self.backoff_multiplier await self.close() return False

As discussed in slack, I think we should combine wait and ping, but it's totally fine to address it in a separate PR.

artem-shelkovnikov · 2024-01-25T10:30:15Z

@seanstory @vidok @wangch079 I've addressed all the points that I've seen in your comments, are there any other things that should be addressed before this PR can be approved?

vidok

Fantastic work 🚀 !

wangch079

Left some comments, but it's totally fine to merge it now.

connectors/config.py

connectors/utils.py

github-actions · 2024-01-29T16:18:17Z

💔 Failed to create backport PR(s)

The backport operation could not be completed due to the following error:
There are no branches to backport to. Aborting.

The backport PRs will be merged automatically after passing CI.

To backport manually run:
backport --pr 2063 --autoMerge --autoMergeMethod squash

artem-shelkovnikov added 3 commits January 15, 2024 13:04

WIP for connectors retry

1915aee

Force ftest for all connectors

4679b8b

Decrease ES size for all ftests

88d2362

github-actions bot added auto-backport v8.13.0.0 labels Jan 15, 2024

artem-shelkovnikov changed the title ~~Artem/actually retry es calls~~ WIP: actually retry 429s for all ES calls Jan 15, 2024

artem-shelkovnikov added 15 commits January 15, 2024 13:39

Fix linters + small bug

aa60d9d

Add retry to one more place

2a23340

Remove retryable

af94a15

Make logs a bit easier to read

7b7482b

Fix some tests, improve logging, make it work even better

989ffdc

Actually fix linting + small refactoring

edc9518

Fix linting

65bcf56

Return instance sizes

658ba84

Add tests for retrier + rename it

fbfdd40

Properly read bulk config

6f67dbe

Remove defaults

d009e43

Some cleanups

9910c4c

Remove change in Dockerfile.ftest

4d3d441

Fix missing comma

50ba7e3

Return default

f9fbe16

artem-shelkovnikov marked this pull request as ready for review January 22, 2024 10:37

artem-shelkovnikov requested a review from a team January 22, 2024 10:37

Merge branch 'main' into artem/actually-retry-es-calls

edc1e2a

artem-shelkovnikov changed the title ~~WIP: actually retry 429s for all ES calls~~ Implement consistent retry policy when calling Elasticsearch Jan 22, 2024

artem-shelkovnikov requested review from a team and removed request for a team January 22, 2024 15:02

seanstory reviewed Jan 22, 2024

View reviewed changes

connectors/config.py Show resolved Hide resolved

connectors/es/client.py Outdated Show resolved Hide resolved

connectors/es/client.py Show resolved Hide resolved

connectors/es/client.py Show resolved Hide resolved

vidok reviewed Jan 24, 2024

View reviewed changes

Retry more error codes

3a111f1

wangch079 reviewed Jan 24, 2024

View reviewed changes

Merge branch 'main' into artem/actually-retry-es-calls

c405831

artem-shelkovnikov requested review from vidok and wangch079 January 25, 2024 10:29

vidok previously approved these changes Jan 25, 2024

View reviewed changes

wangch079 previously approved these changes Jan 25, 2024

View reviewed changes

artem-shelkovnikov added 2 commits January 26, 2024 11:40

Return elasticsearch.bulk.max_retries back

4be0bd8

Remove deprecation notices for now

d0b1fdd

artem-shelkovnikov dismissed stale reviews from wangch079 and vidok via d0b1fdd January 26, 2024 10:41

artem-shelkovnikov added 3 commits January 26, 2024 11:50

Improve bulk insert retryable to make it work same like in other retries

fe133e8

Add config options into yml file

e4cbd88

Merge branch 'main' into artem/actually-retry-es-calls

4c12c18

artem-shelkovnikov requested review from vidok and wangch079 January 26, 2024 11:12

seanstory previously approved these changes Jan 29, 2024

View reviewed changes

connectors/config.py Outdated Show resolved Hide resolved

connectors/config.py Outdated Show resolved Hide resolved

connectors/utils.py Outdated Show resolved Hide resolved

Address comments

244ce2e

artem-shelkovnikov dismissed seanstory’s stale review via 244ce2e January 29, 2024 15:22

seanstory previously approved these changes Jan 29, 2024

View reviewed changes

Fix linters

faf1479

artem-shelkovnikov dismissed seanstory’s stale review via faf1479 January 29, 2024 15:36

vidok approved these changes Jan 29, 2024

View reviewed changes

Merge branch 'main' into artem/actually-retry-es-calls

2d64473

artem-shelkovnikov enabled auto-merge (squash) January 29, 2024 16:11

artem-shelkovnikov merged commit 17a76e9 into main Jan 29, 2024
2 checks passed

artem-shelkovnikov deleted the artem/actually-retry-es-calls branch January 29, 2024 16:17

seanstory added the release_note label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement consistent retry policy when calling Elasticsearch #2063

Implement consistent retry policy when calling Elasticsearch #2063

artem-shelkovnikov commented Jan 15, 2024 •

edited

Loading

vidok Jan 24, 2024

artem-shelkovnikov Jan 24, 2024

vidok commented Jan 24, 2024

wangch079 Jan 24, 2024

artem-shelkovnikov Jan 25, 2024

wangch079 Jan 25, 2024

artem-shelkovnikov commented Jan 25, 2024

vidok left a comment

wangch079 left a comment

github-actions bot commented Jan 29, 2024

Implement consistent retry policy when calling Elasticsearch #2063

Implement consistent retry policy when calling Elasticsearch #2063

Conversation

artem-shelkovnikov commented Jan 15, 2024 • edited Loading

Closes https://github.com/elastic/enterprise-search-team/issues/6412

Checklists

Pre-Review Checklist

Release Note

vidok Jan 24, 2024

Choose a reason for hiding this comment

artem-shelkovnikov Jan 24, 2024

Choose a reason for hiding this comment

vidok commented Jan 24, 2024

wangch079 Jan 24, 2024

Choose a reason for hiding this comment

artem-shelkovnikov Jan 25, 2024

Choose a reason for hiding this comment

wangch079 Jan 25, 2024

Choose a reason for hiding this comment

artem-shelkovnikov commented Jan 25, 2024

vidok left a comment

Choose a reason for hiding this comment

wangch079 left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 29, 2024

💔 Failed to create backport PR(s)

artem-shelkovnikov commented Jan 15, 2024 •

edited

Loading