Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmos DB: Add ClientRetryPolicy for cross-region failover #18985

Closed
ealsur opened this issue Aug 30, 2022 · 8 comments · Fixed by #22394
Closed

Cosmos DB: Add ClientRetryPolicy for cross-region failover #18985

ealsur opened this issue Aug 30, 2022 · 8 comments · Fixed by #22394
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos feature-request This issue requires a new behavior in the product in order be resolved.

Comments

@ealsur
Copy link
Member

ealsur commented Aug 30, 2022

Add ClientRetryPolicy that leverages GlobalEndpointManager to do cross region failover

Reference: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs or https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java (the same policy exists in other languages, can use Python or NodeJS also)

Requires #18983

Requirements:

  • Create a Pipeline Policy that executes PerRetry (see newPipeline on cosmos_client.go).
  • The policy should cover the ClientRetryPolicy scenarios:
    • Before the request is sent, use the request information (is it a write or read requests? Any previous retries? See the RetryContext on other languages) to decide which Location to use from the GlobalEndpointManager (gem should be available on the client and can be passed to the new policy)
    • Verify existing retry count limits on existing ClientRetryPolicy implementations
    • Follow ClientRetryPolicy cases from references, which include :
      • On a 403.3 response, mark the current region as unavailable for writes, refresh the account information, and retry
      • On a 403.1008 response, mark the current region as unavailable for reads, refresh the account information, and retry
      • On HTTP 503 response, if it's a read request and preferredRegions > 1, retry on the next preferredRegion. If it's a write request and account is multi master and preferredRegions > 1, retry on the next preferredRegion.
      • On 404/1002 retry on the primary region (session consistency guarantee) once
      • On HTTP timeouts (Verify how HTTP timeouts surface in Golang), retry on read requests on the next preferredRegion if preferredRegions > 1
@ealsur ealsur added Cosmos feature-request This issue requires a new behavior in the product in order be resolved. labels Aug 30, 2022
@ealsur ealsur moved this to Triage in Azure Cosmos SDKs Aug 30, 2022
@RickWinter RickWinter added the Client This issue points to a problem in the data-plane of the library. label Aug 30, 2022
@ealsur ealsur moved this from Triage to Approved in Azure Cosmos SDKs May 23, 2023
@jay-most
Copy link
Contributor

Notes: Time 2 weeks | T-Shirt S | -> failure cases, for the gem. 403 errors for example and mark unavailable.

@fabistb
Copy link

fabistb commented Dec 7, 2023

Any update on this?
With the increasing popularity of dapr the number of indirect sdk users will increase.
We recently had a an outage in West Europe and since the fail over didn't work we ended up in an error state for multiple hours.

@ealsur
Copy link
Member Author

ealsur commented Dec 7, 2023

@fabistb The team is still working on it.

The current SDK uses the Global Domain DNS to handle operations, outages and their failovers should still be handled by the Global Domain DNS if the account has Service Managed Failover enabled because the service takes care of updating the Global Domain DNS to another region (assuming the account has > 1).

@sding3
Copy link

sding3 commented Feb 1, 2024

(sorry about the wall of text)

We currently use a home grown go cosmosDB SDK, and we are looking at improving it to handle failovers better. So I wanted to get your feedback on our plan, @ealsur .

In our current setup, we always use cosmosDB account with a single write region and potentially multiple read regions, and we want to read region to be the same as the write region. Those are our current assumptions.

Recently, we have done some manual failovers and saw our services continuously get 403.3 errors until the services were manually bounced. The 403.3 errors were present even after the "{account}.documents.azure.com" DNS CNAME had been updated from Azure's side. This is because the effected services had existing connections with the prior IP address of the old region, and these connections were kept alive by continuous API interaction with cosmosDB, even though all API interactions were all receiving 403.3 errors.

Our home grown go cosmosDB SDK currently has no handling of regional routing or detection of these 403.3 errors. It always makes requests targeting the "{account}.documents.azure.com" DNS name (I believe this is what you refer as the "Global Domain DNS").

We are looking to add detection for 403.3 response errors. Upon detecting such an error, our client will make a GET request to "https://{account}.documents.azure.com/" to get the current WritableRegions array. If this array contain 1 or more members, our client will begin using the 1st member's Endpoint field from the WritableRegions array for its subsequent requests with cosmosDB (this would be an in-memory atomic update). We will interpret these 403.3 error as retryable in our retry handling such that any remaining retries may succeed as the retries get steered toward the new region.

What do you think of this approach?

I'm also wondering about if MS has any plans to graduate the cosmosDB Go SDK to GA status ?

@ealsur
Copy link
Member Author

ealsur commented Feb 1, 2024

@sding3 This workitem is tracked for the GA and the team is actively working on it. The pieces required for it to work were already completed so this is one of the next ones the team will take on.

If you want to learn what we plan to do, you can look at how other SDKs (like .NET) handles it: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/SdkDesign.md#cross-region-retries

There is more than just 403.3.

@sding3
Copy link

sding3 commented Feb 1, 2024

Thanks for your reply. I believe our plan is consistent with the following from the design of the .NET SDK:

HTTP 403 with Substatus 3 - The current region is no longer a Write region (write region failover), the account information is refreshed, and the request is retried on the new Write region.

I'll take a look at the other conditions as well.

@topshot99
Copy link
Member

@ealsur These Retry policies are present within the Node/JavaScript SDK. You can find more information about them at this link:
https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/cosmosdb/cosmos/src/retry/retryUtility.ts#L117-L136

Is there anything specific that we're overlooking or that you would like us to include?
cc: @sajeetharan

@ealsur
Copy link
Member Author

ealsur commented Feb 26, 2024

@topshot99 Not sure I understand your comment. This Issue is for the Go SDK

@ealsur ealsur closed this as completed Mar 15, 2024
@github-project-automation github-project-automation bot moved this from Approved to Done in Azure Cosmos SDKs Mar 15, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jun 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos feature-request This issue requires a new behavior in the product in order be resolved.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants