Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cosmos] Implements Client Retry policy #22394

Merged
merged 31 commits into from
Mar 15, 2024

Conversation

simorenoh
Copy link
Member

@simorenoh simorenoh commented Feb 15, 2024

Closes #18985.

This PR in it's entirety will include:

Client Retry policy implementation covering:

  • Session retry scenarios (404-1002)
  • Endpoint unavailable retry scenarios (403-1002/8)
  • Service unavailable retry scenarios (503)
  • DNS error scenarios
  • Tests covering all of the above scenarios
  • Tests covering external scenarios like connections dropping (DNS errors)

  • New function added to location_cache to resolve service endpoint to be used on cross-regional retries
  • Tests for the above new function

  • New public surface area client option, DisableCrossRegionalRetries (known as enable_endpoint_discovery in most other SDKs - have to use a negative here for boolean default in Go). -> PR here.

  • Upgrade in go.mod file from go 1.18 to go 1.22 and update ci.yml - 1.21 is the earliest version containing a Go-native min() method and the ci pipelines test current version, N-1 and N-2

@ealsur
Copy link
Member

ealsur commented Feb 16, 2024

I just realized we are missing something critical. We are never "initializing" the SDK. The first operation that comes needs to "initialize" the SDK by performing a request to GEM to obtain the account information, that will initialize the location cache.

Otherwise, requests cannot be routed.

We can certainly leave that part for another PR, but it needs to happen only for the first request (like other SDKs)

@simorenoh
Copy link
Member Author

simorenoh commented Feb 22, 2024

@ealsur On the initialization bit, I thought that was taken care of by the gemPolicy? when we initialize the gem that gets passed to the gem policy we set it's lastUpdateTime to time.Time{} (beginning of time), which ensures the first request in the pipeline will run the Update() method in the gemPolicy, populating the account information in the gem + location cache

@ealsur
Copy link
Member

ealsur commented Mar 1, 2024

Once we have everything in place, these are the manual validation steps we should perform:

  1. Create a real account with regions A and B. Assuming A is write, B is read.
  2. Write a loop, performing 1 write and 1 read (the read do it of an existing doc, not the one created on the write), sleeping for a second inbetween
  3. Configure the client with PreferredRegions B.
  4. Enable logging

Scenario 1 - Removing a region

  1. Verify the Reads are going to Region B (the url endpoint of the request should be the regional of B)
  2. Go to the Azure Portal, remove Region B from the account.
  3. Client should not fail any request, read requests should now show to go to region A

Scenario 2 - Adding a region

  1. With the same loop running, add back region B.
  2. The client should eventually start sending read requests back to region B. This might take some minutes.

Scenario 3 - Failover

  1. With the same loop running, go to the portal, and trigger a Manual Failover, converting Region B into the Write region.
  2. Client should not fail any request, writes should eventually go to region B.

@simorenoh simorenoh merged commit 745ca8f into Azure:main Mar 15, 2024
11 checks passed
@simorenoh simorenoh deleted the cosmos_client_retry_policy branch March 15, 2024 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cosmos DB: Add ClientRetryPolicy for cross-region failover
3 participants