Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement consistent retry policy when calling Elasticsearch #2063

Merged
merged 32 commits into from
Jan 29, 2024

Conversation

artem-shelkovnikov
Copy link
Member

@artem-shelkovnikov artem-shelkovnikov commented Jan 15, 2024

Closes https://github.com/elastic/enterprise-search-team/issues/6412

This PR makes our retry policy when calling Elasticsearch consistent between all requests to Elasticsearch.

It's possible to specify the following settings in a config file:

elasticsearch.max_retries: 5
elasticsearch.retry_interval: 10

The above setup states, that we'll retry every call (wrapped with retryable abstraction) 5 times, starting from 10 seconds on first retry. Internally, retry policy is Linear, so first retry will be 10 seconds, second retry will be 20 seconds, and so on.

For now absolutely every call to Elasticsearch will be retried like this (I wrapped all calls with the retrier).

Errors that are retried are:

  • Api Error with response code 429, 500, 502, 503, 504
  • Connection Timeout

To do the work I've introduced a TransientElasticsearchRetrier class that handles the logic.

Retry method signature is:

async def execute_with_retry(self, func):

To retry something, you need to do the following:

license_response = await self._retrier.execute_with_retry(
            self.client.license.get
        )

In case the ES call requires arguments, partial should be used:

return await self._retrier.execute_with_retry(
            partial(
                self.client.indices.create,
                index=search_index_name,
                mappings=mappings,
                settings=settings,
            )
        )

Separate class was created because:

  1. I need to make this retry configurable, so I need to pass configuration values into it. It's not possible to do unless we expose config globally, but that's not we do, thus the class.
  2. I'm able to cancel the retries this way - currently retries in @retryable wrapper are not cancellable, so stopping a running job that's inside retry loop will generate an error

Checklists

Pre-Review Checklist

  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
  • Considered corresponding documentation changes
  • Contributed any configuration settings changes to the configuration reference

Release Note

Improve connection resiliency when accessing elasticsearch - retry all failed calls that fail because of unhealthy Elasticsearch before terminating with an error. This logic is configurable with newly added elasticsearch.max_retries and elasticsearch.retry_interval configuration fields.

elasticsearch.bulk.max_retries was increased from 3 to 5 to match newly added elasticsearch.max_retries default value. Additionally elasticsearch.bulk.retry_interval was added with default value of 10 seconds (previously was hardcoded to 1 second) to match behaviour of newly added elasticsearch.retry_interval. As a result, bulk inserts will retry for 10 * (1 + 2 + 3 + 4 + 5) = 150 seconds, before this number was 1 * (1 + 2 + 3) = 6 seconds.

@artem-shelkovnikov artem-shelkovnikov changed the title Artem/actually retry es calls WIP: actually retry 429s for all ES calls Jan 15, 2024
@artem-shelkovnikov artem-shelkovnikov marked this pull request as ready for review January 22, 2024 10:37
@artem-shelkovnikov artem-shelkovnikov requested a review from a team January 22, 2024 10:37
@artem-shelkovnikov artem-shelkovnikov changed the title WIP: actually retry 429s for all ES calls Implement consistent retry policy when calling Elasticsearch Jan 22, 2024
@artem-shelkovnikov artem-shelkovnikov requested review from a team and removed request for a team January 22, 2024 15:02
@@ -41,6 +46,10 @@ def __init__(self, config):
use_default_ports_for_scheme=True,
)
self._sleeps = CancellableSleeps()
self._retrier = TransientElasticsearchRetrier(
logger, config.get("max_retries", 5), config.get("retry_timeout", 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use constants like DEFAULT_MAX_RETRIES instead of the raw number. It's just easier to find and change them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added constants in a4d4f8b

@vidok
Copy link
Contributor

vidok commented Jan 24, 2024

@elastic/ingestion-team

Thinking of config options additionally - we have elasticsearch.initial_backoff, elasticsearch.backoff_multiplier and elasticsearch.max_wait_duration that is specific to Elasticsearch client wait action that happens during startup.

I'm adding elasticsearch.max_retries and elasticsearch.retry_interval that are related to all elasticsearch calls.

Does it make it confusing that these options are in the root of elasticsearch config option? Should I put the newly added options into some nested key, like elasticsearch.connection or similar?

Reposting my comment for the record (repeating your ideas :) ):

The new keys `elasticsearch.max_retries` and `elasticsearch.retry_interval` should be in the root because the other root keys already indicate they are used as Elasticsearch connection configuration. 

The old keys can be moved under another prefix because they impact a specific use case. 

@@ -162,6 +165,81 @@ async def wait(self):
await self.close()
return False

async def ping(self):
try:
await self.client.info()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ping is not retried?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not retried cause of wait method - wait calls ping and does retrying based on its own config options. See method:

    async def wait(self):
        backoff = self.initial_backoff_duration
        # ...
        while time.time() - start < self.max_wait_duration:
            if not self._keep_waiting:
                await self.close()
                return False

            # ...
            if await self.ping():
                return True
            await self._sleeps.sleep(backoff)
            backoff *= self.backoff_multiplier

        await self.close()
        return False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in slack, I think we should combine wait and ping, but it's totally fine to address it in a separate PR.

@artem-shelkovnikov
Copy link
Member Author

@seanstory @vidok @wangch079 I've addressed all the points that I've seen in your comments, are there any other things that should be addressed before this PR can be approved?

vidok
vidok previously approved these changes Jan 25, 2024
Copy link
Contributor

@vidok vidok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work 🚀 !

wangch079
wangch079 previously approved these changes Jan 25, 2024
Copy link
Member

@wangch079 wangch079 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but it's totally fine to merge it now.

@artem-shelkovnikov artem-shelkovnikov dismissed stale reviews from wangch079 and vidok via d0b1fdd January 26, 2024 10:41
seanstory
seanstory previously approved these changes Jan 29, 2024
seanstory
seanstory previously approved these changes Jan 29, 2024
@artem-shelkovnikov artem-shelkovnikov enabled auto-merge (squash) January 29, 2024 16:11
@artem-shelkovnikov artem-shelkovnikov merged commit 17a76e9 into main Jan 29, 2024
2 checks passed
@artem-shelkovnikov artem-shelkovnikov deleted the artem/actually-retry-es-calls branch January 29, 2024 16:17
Copy link

💔 Failed to create backport PR(s)

The backport operation could not be completed due to the following error:
There are no branches to backport to. Aborting.

The backport PRs will be merged automatically after passing CI.

To backport manually run:
backport --pr 2063 --autoMerge --autoMergeMethod squash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants