-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmos DB: Add ClientRetryPolicy for cross-region failover #18985
Comments
Notes: Time 2 weeks | T-Shirt S | -> failure cases, for the gem. 403 errors for example and mark unavailable. |
Any update on this? |
@fabistb The team is still working on it. The current SDK uses the Global Domain DNS to handle operations, outages and their failovers should still be handled by the Global Domain DNS if the account has Service Managed Failover enabled because the service takes care of updating the Global Domain DNS to another region (assuming the account has > 1). |
(sorry about the wall of text) We currently use a home grown go cosmosDB SDK, and we are looking at improving it to handle failovers better. So I wanted to get your feedback on our plan, @ealsur . In our current setup, we always use cosmosDB account with a single write region and potentially multiple read regions, and we want to read region to be the same as the write region. Those are our current assumptions. Recently, we have done some manual failovers and saw our services continuously get 403.3 errors until the services were manually bounced. The 403.3 errors were present even after the "{account}.documents.azure.com" DNS CNAME had been updated from Azure's side. This is because the effected services had existing connections with the prior IP address of the old region, and these connections were kept alive by continuous API interaction with cosmosDB, even though all API interactions were all receiving 403.3 errors. Our home grown go cosmosDB SDK currently has no handling of regional routing or detection of these 403.3 errors. It always makes requests targeting the "{account}.documents.azure.com" DNS name (I believe this is what you refer as the "Global Domain DNS"). We are looking to add detection for 403.3 response errors. Upon detecting such an error, our client will make a GET request to "https://{account}.documents.azure.com/" to get the current What do you think of this approach? I'm also wondering about if MS has any plans to graduate the cosmosDB Go SDK to GA status ? |
@sding3 This workitem is tracked for the GA and the team is actively working on it. The pieces required for it to work were already completed so this is one of the next ones the team will take on. If you want to learn what we plan to do, you can look at how other SDKs (like .NET) handles it: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/SdkDesign.md#cross-region-retries There is more than just 403.3. |
Thanks for your reply. I believe our plan is consistent with the following from the design of the .NET SDK:
I'll take a look at the other conditions as well. |
@ealsur These Retry policies are present within the Node/JavaScript SDK. You can find more information about them at this link: Is there anything specific that we're overlooking or that you would like us to include? |
@topshot99 Not sure I understand your comment. This Issue is for the Go SDK |
Add ClientRetryPolicy that leverages GlobalEndpointManager to do cross region failover
Reference: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs or https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java (the same policy exists in other languages, can use Python or NodeJS also)
Requires #18983
Requirements:
PerRetry
(seenewPipeline
oncosmos_client.go
).The text was updated successfully, but these errors were encountered: