Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientRetryPolicy: Adds Cross Regional Retry Logic on 429/3092 and 410/1022 #4677

Conversation

kundadebdatta
Copy link
Member

@kundadebdatta kundadebdatta commented Sep 11, 2024

Pull Request Template

Description

This PR is covering two CRI scenarios in general:

  • Gone Exception (Status Code: 410) with Lease Not Found (Sub Status Code: 1022).
  • Request Throttled (Status Code: 429) with System Resource Unavailable (Sub Status Code: 3092).

The behavior changes are described below in detail:

Background: In general 410/1022 is used as a signal to indicate that a region is not part of the dynamic quorum - meaning in global strong with dynamic quorum all reads from the secondary region kicked out of the quorum will return 410/1022.

  • Today when a read region loses the read lease it returns 410/LeaseNotFound - the client will then retry locally for up-to 60 seconds (for global strong) and up-to 30 seconds (for others) (catch-all 410 retry like for 410/0 or connectivity related 410) for any.
    Instead the 410/LeaseNotFound(1022) should be mapped immediately to a 503 to allow the cross-regional retry to happen as quickly as possible. Retrying in the next preferred region like usual for reads is acceptable. This has been taken care in the Cosmos.Direct release 3.36.0 and this PR is upgrading the Cosmos.Direct version to 3.36.0 to reflect the change.

  • For 429/3092 (Request Throttled/ SystemResourceUnavailable), we would only want to short-circuit this to retry cross-region quickly when we can actually retry cross region (not a write in single master) - and for reads only when it is returned from all replica in a region. The latter is extremely unlikely - so, I would probably compromise on a pragmatic approach and simply short-circuit by mapping 429/3092 to a 503 for write operations when multi-master is enabled (and there is >1 write regions).
    That way we get the cross-region retry in the scenario where this is most critical.

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #4656, #4390

@kundadebdatta kundadebdatta added the Do Not Review Marks a PR in "work in progress" state. label Sep 11, 2024
@kundadebdatta kundadebdatta changed the title ClientRetryPolicy: Adds Cross Regional Retry Logic on 429/ 3092 and 410/ 1022 ClientRetryPolicy: Adds Cross Regional Retry Logic on 429/3092 and 410/1022 Sep 12, 2024
@kundadebdatta kundadebdatta marked this pull request as ready for review September 12, 2024 22:55
@kundadebdatta kundadebdatta added improvement Change to existing functional behavior (perf, logging, etc.) and removed Do Not Review Marks a PR in "work in progress" state. labels Sep 12, 2024
@kundadebdatta kundadebdatta self-assigned this Sep 12, 2024
@NaluTripician
Copy link
Contributor

Since you are updating the direct package in this PR it might be good to include a summary of the changes in the PR description as well.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks!

@@ -3,7 +3,7 @@
<ClientOfficialVersion>3.43.0</ClientOfficialVersion>
<ClientPreviewVersion>3.44.0</ClientPreviewVersion>
<ClientPreviewSuffixVersion>preview.0</ClientPreviewSuffixVersion>
<DirectVersion>3.35.0</DirectVersion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

410/1022 is directly coming from direct package right? No changes are needed.
How about a separate PR just with that scope?

In-general that's the pattern we are following where direct revision with features it brings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. However, if we are splitting this PR into two (410/1022 and 429/3092), then the Direct PR needed to go first since this work (329/3092) has dependency on the sub status codes, which is originated from the Direct package.

// Today, the only scenario where we would receive a ServiceUnavailableException from the Throttling Retry Policy
// is when we get 410 (Gone) with sub status code 3092 (System Resource Not Available). Note that this is applicable
// for write requests targeted to a multiple master account. In such case, the 410/3092 will get converted into 503.
if (throttleRetryResult.ExceptionToThrow is ServiceUnavailableException)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: Alternative is to do 429/3092 check before throttlingPolicy here explicitly, that seems direct and simpler right?
Also exception creation is expensive and throttle do last longer and overall impact will be higher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error combination we are trying to handle here is 429/3092 which itself is a scenario of a throttled response. Therefore, IMO ideal place to handle such cases would be the ResourceThrottleRetryPolicy. This will avoid any unnecessary status combination check that should not belong to the ClientRetryPolocy.

On creating and throwing the exception part, I will double check if this can be optimized, but from the scenario handling perspective, I feel ResourceThrottleRetryPolicy is the ideal place.

@FabianMeiswinkel : Any thoughts on this ?

@@ -529,11 +529,12 @@ public bool ShouldRefreshEndpoints(out bool canRefreshInBackground)

public bool CanUseMultipleWriteLocations(DocumentServiceRequest request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its used in other scenario like evaluating for sessionToken sending or not.
SDK region count might be stale and it should not be used to make decision of not passing sesisonToken.

These are two different usecases and should be distinguished.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the background account refresh (InitializeAccountPropertiesAndStartBackgroundRefresh()) should take care of the staleness isn't it ? Any way, will refactor the code to create another method just to safeguard the existing code path.

auto-merge was automatically disabled September 13, 2024 17:33

Pull request was closed

@kundadebdatta kundadebdatta deleted the users/kundadebdatta/4390_cross_regional_retry_on_3092_and_1022 branch September 13, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Enables automation to merge PRs improvement Change to existing functional behavior (perf, logging, etc.)
Projects
None yet
4 participants