Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability: Adds cross-region retry mechanism on transient connectivity issues #1715

Merged
merged 3 commits into from
Jul 23, 2020

Conversation

ealsur
Copy link
Member

@ealsur ealsur commented Jul 22, 2020

Description

Currently, when a client instance has issues connecting to a particular regional endpoint (connectivity timeouts, Azure SNAT issues, transient network blips) the connectivity stack does local (same region) retries, after which, the issue bubbles up as the known 503 - Service Unavailable (which normally has a TransportException inside).

This retry policy is scoped to a particular request, it does not affect other requests in-flight.

Requirements

For the retry mechanism to kick in (in a georeplicated Cosmos DB account in more than 1 region), the requirement is that the CosmosClient is initialized with either:

  • CosmosClientOptions.ApplicationRegion defining which is the current region where the application resides on. This builds a preference list of regions where the ApplicationRegion is on top of the list.
  • CosmosClientOptions.ApplicationPreferredRegions defining a preference list of 2 or more regions.

When using any of these 2 initialization options, the client builds a list of preferred regions to connect to (which is a subset of the account regions).

Which are the operations that get retried

Only the following subset of operations will be affected by this retry mechanism (if requirements are met):

  • Single master accounts - Read requests.
  • Multi master accounts - Read requests.
  • Multi master accounts - Write requests.

Retry flow

  1. One of the retriable operations has a 503 result from the transport stack.
  2. The retry policy verifies the client configuration to see which is the list of preferred regions and if the required configuration is met.
  3. The retry policy then picks the second region in the preference list.
  4. A single retry is done repeating the request to the second region.
  5. If the retry also faces a 503 failure, the 503 is bubbled up to the user.

Type of change

  • New feature (non-breaking change which adds functionality)

@ealsur ealsur added the improvement Change to existing functional behavior (perf, logging, etc.) label Jul 22, 2020
@ealsur ealsur self-assigned this Jul 22, 2020
@ealsur ealsur merged commit c66dd67 into master Jul 23, 2020
@ealsur ealsur deleted the users/ealsur/503retry branch July 23, 2020 16:18
@ghost
Copy link

ghost commented Dec 15, 2021

Closing due to in-activity, pease feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Change to existing functional behavior (perf, logging, etc.)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants