You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Cosmos .NET V3 SDK should attempt to retry on to another region for fetching the collection information (Read Collection call) or partition key ranges information (Get PkRanges call), if the master partition of the primary region is in complete quorum loss. However, this is not happening today reason being the request to the routing gateway takes more than 65 seconds to respond back, thus timing out the SDK request. The SDK makes 3 retries, each of which times out within 65 seconds. Today, our .NET v3 SDK doesn't retry on gateway timeouts (on TaskCancelled exceptions), thus if the metadata information is not retrieved, then the SDK is stuck to get initialized.
Account Setup For 3 regions : Create a cosmos account with 3 regions, P1 (Write), P2 (Read) and P3 (Read). The PPAF configuration from the BE is to failover to P2, in case P1 is unavailable.
Scenario: While creating the cosmos client, in the application preferred region, provide P1, P2 and P3 as preferred regions.
CosmosClientOptions clientOptions = new CosmosClientOptions()
{
ApplicationPreferredRegions = new List<string>()
{
Regions.P1,
Regions.P2,
Regions.P3
},
EnablePartitionLevelFailover = true,
Before initializing the cosmos client, use the service fabric commands to trigger a "full quorum loss" on the master partition.
Current Behavior:
The SDK keeps retrying on the region P1 for reading the collection information and times out eventually. To understand this better, take a look at the below diagnostics:
Diagnostics Snippet - Scenario: Master Partitions are in Complete Quorum Loss.
Ideally, the above setup should have worked and the SDK should have retried to the region P2 to get the collection information from the gateway. Note that this behavior is expected irrespective to the fact that per partition automatic failover is enabled or not.
Scope:
The changes discussed, are applicable for a CosmosClient set in Direct mode.
These changes are applicable to cold start a CosmosClient, when all the metadata information are needed to be fetched and cached.
High Level Changes:
Update the client retry policy to retry on gateway timeouts. Timeouts often throw TaskCancelledException which is eventually wrapped in a CosmosException from the CosmosHttpClient. The idea is to extend the retry policy to retry on CosmosExceptions.
Create a new retry policy called MetadataRequestThrottleRetryPolicy, which is a wrapper around the RequestThrottleRetryPolicy and particularly handles all of the metadata requests. The purpose is to mark an endpoint unavailable for read, when a gateway timeout occurs, so that the next retry could happen on another region.
Design Approach:
sequenceDiagram
participant A as ContainerCore.Item <br> [v3 Code]
participant B as ClientContextCore <br> [v3 Code]
participant C as RequestInvokerHandler <br> [v3 Code]
participant D as RetryHandler <br> [v3 Code]
participant E as TransportHandler <br> [v3 Code]
participant F as ServerStoreModel <br> [Direct Code]
participant G as StoreClient <br> [Direct Code]
participant H as ReplicatedResourceClient <br> [Direct Code]
participant I as ConsistencyWriter <br> [Direct Code]
participant J as AddressSelector <br> [Direct Code]
participant K as GlobalAddressResolver <br> [v3 Code]
participant L as AddressResolver <br> [v3 Code]
participant M as PartitionKeyRangeCache <br> [v3 Code]
participant N as GatewayStoreModel <br> [v3 Code]
participant O as GatewayAddressCache <br> [v3 Code]
participant R as GatewayStoreClient <br> [v3 Code]
participant P as CosmosHttpClient <br> [v3 Code]
A->>B: 1. ProcessResourceOperationStreamAsync()
B->>C: 2. SendAsync()
C->>D: 3. SendAsync()
critical ClientRetryPolicy OperationType: Create, ResourceType: Document
D->>E: 4. ProcessMessageAsync()
E->>F: 5. ProcessMessageAsync()
F->>G: 6. ProcessMessageAsync()
G->>H: 7. InvokeAsync()
H->>I: 8. WriteAsync()
I->>J: 9. ResolveAddressesAsync()
J->>K: 10. ResolveAsync()
K->>L: 11. ResolveAsync()
L->>L: 12. ResolveAddressesAndIdentityAsync()
L->>M: 13. TryLookupAsync()
M->>M: 14. GetRoutingMapFor<br>CollectionAsync()
critical Get Pk Ranges using ResourceThrottleRetryPolicy OperationType: ReadFeed, ResourceType: PartitionKeyRange
M->>M: 15. ExecutePartitionKey<br>RangeReadChangeFeedAsync()
M->>N: 16. ProcessMessageAsync()
N->>R: 17. InvokeAsync()
R->>P: 18. SendHttpAsync()
end
critical Get Server Addresses
L->>O: 19. TryGetAddressesAsync()
O->>P: 20. GetAsync()
end
end
Loading
Current Flow:
sequenceDiagram
participant J as ClientRetryPolicy <br> [v3 Code]
participant A as ConsistencyWriter <br> [Direct Code]
participant B as AddressSelector <br> [Direct Code]
participant C as GlobalAddressResolver <br> [v3 Code]
participant D as AddressResolver <br> [v3 Code]
participant E as PartitionKeyRangeCache <br> [v3 Code]
participant K as ResourceThrottleRetryPolicy <br> [v3 Code]
participant F as GatewayStoreModel <br> [v3 Code]
participant G as GatewayAddressCache <br> [v3 Code]
participant H as GatewayStoreClient <br> [v3 Code]
participant I as CosmosHttpClient <br> [v3 Code]
J-->>A: 1. DocumentService<br>Request
loop Retry Iterations with force-refresh flag = false/ true
A->>B: 2. ResolveAddressesAsync()
B->>C: 3. ResolveAsync()
C->>D: 4. ResolveAsync()
D->>D: 5. ResolveAddressesAndIdentityAsync()
D->>E: 6. TryLookupAsync()
E->>E: 7. GetRoutingMapFor<br>CollectionAsync()
E->>E: 8. ExecutePartitionKey<br>RangeReadChangeFeedAsync()
E->>F: 9. ProcessMessageAsync()
F->>H: 10. InvokeAsync()
H->>I: 11. SendHttpAsync()
Note over K: PartitionKeyRangeCache uses the <br> ResourceThrottleRetryPolicy to <br> retry the GET PkRanges call
I-->>K: 12. CosmosException with 503 on timeout
K->>K: 13. ShouldRetryAsync() <br> = RetryResult.NoRetry()
end
K-->>J: 14. CosmosException with 503 on timeout
J->>J: 15. RetryAfter <br> (TimeSpan.Zero)
D->>G: 16. TryGetAddressesAsync()
G->>I: 17. GetAsync()
Loading
Sample Diagnostics with Current Flow:
Scenario: Master Partitions are in Complete Quorum Loss.
sequenceDiagram
participant J as ClientRetryPolicy <br> [v3 Code]
participant A as ConsistencyWriter <br> [Direct Code]
participant B as AddressSelector <br> [Direct Code]
participant C as GlobalAddressResolver <br> [v3 Code]
participant D as AddressResolver <br> [v3 Code]
participant E as PartitionKeyRangeCache <br> [v3 Code]
participant K as MetadataRequestThrottleRetryPolicy <br> [v3 Code]
participant F as GatewayStoreModel <br> [v3 Code]
participant G as GatewayAddressCache <br> [v3 Code]
participant H as GatewayStoreClient <br> [v3 Code]
participant I as CosmosHttpClient <br> [v3 Code]
participant L as GlobalEndpointManager <br> [v3 Code]
participant M as TransportClient <br> [Direct Code]
J-->>A: 1. DocumentService<br>Request
loop Retry Iterations with force-refresh flag = false/ true
A->>B: 2. ResolveAddressesAsync()
B->>C: 3. ResolveAsync()
C->>D: 4. ResolveAsync()
D->>D: 5. ResolveAddressesAndIdentityAsync()
D->>E: 6. TryLookupAsync()
E->>E: 7. GetRoutingMapFor<br>CollectionAsync()
E->>E: 8. ExecutePartitionKey<br>RangeReadChangeFeedAsync()
E->>K: 9. OnBeforeSendRequest()
K->>K: 10. Sets requestContext.RouteToLocation() <br> using current location index
K->>L: 11. ResolveServiceEndpoint()
L-->>K: 12. Resolve and save current endpoint
E->>F: 13. ProcessMessageAsync(requestContext)
F->>H: 14. InvokeAsync()
H->>I: 15. SendHttpAsync()
Note over K: PartitionKeyRangeCache uses the <br> MetadataRequestThrottleRetryPolicy <br> to retry the GET PkRanges call. <br> OperationType = ReadFeed. <br> ResourceType = PartitionKeyRange
I-->>K: 16. CosmosException with 503
I-->>D: 22. Successful Response in second attempt.
K->>K: 17. ShouldRetryAsync() kicks in.
K->>K: 18. IncrementRetryIndexOnService<br>UnavailableForMetadataRead().<br> Increments the next location index.
K->>K: 19. Retry the request <br> on the next location index.
end
K-->>J: 20. CosmosException with 503
J->>J: 21. RetryAfter <br> (TimeSpan.Zero)
D->>G: 23. TryGetAddressesAsync()
G->>I: 24. GetAsync()
I-->>G: 25. Gets Address Information
G-->>D: 26. Gets Address Information
D-->>A: 27. Gets PerProtocolPartitionAddressInformation
A->>M: 28. InvokeResourceOperationAsync(primaryUri)
Loading
Complete Diagnostics After the Above Design Changes:
Scenario: Master Partitions are in Complete Quorum Loss.
Currently we have observed that the GET Document calls to fetch the address intermittently fails, when the master partition is in complete quorum loss. This is a known issue, and the Routing Gateway team is currently investigating this.
The text was updated successfully, but these errors were encountered:
kundadebdatta
changed the title
[Per Partition Automatic Failover] Retry on Next Preferred Region For Gateway Reads If SDK is Timing Out While Connecting to Primary Region
[Per Partition Automatic Failover] Retry on Next Preferred Region For Metadata Reads Gateway Timeouts
Nov 9, 2023
Background:
The Cosmos .NET V3 SDK should attempt to retry on to another region for fetching the collection information (Read Collection call) or partition key ranges information (Get PkRanges call), if the master partition of the primary region is in complete quorum loss. However, this is not happening today reason being the request to the routing gateway takes more than
65
seconds to respond back, thus timing out the SDK request. The SDK makes3
retries, each of which times out within65
seconds. Today, our .NET v3 SDK doesn't retry on gateway timeouts (on TaskCancelled exceptions), thus if the metadata information is not retrieved, then the SDK is stuck to get initialized.Account Setup For 3 regions : Create a cosmos account with 3 regions, P1 (Write), P2 (Read) and P3 (Read). The PPAF configuration from the BE is to failover to P2, in case P1 is unavailable.
Scenario: While creating the cosmos client, in the application preferred region, provide P1, P2 and P3 as preferred regions.
Before initializing the cosmos client, use the service fabric commands to trigger a "full quorum loss" on the master partition.
Current Behavior:
The SDK keeps retrying on the region P1 for reading the collection information and times out eventually. To understand this better, take a look at the below diagnostics:
Diagnostics Snippet - Scenario: Master Partitions are in Complete Quorum Loss.
Expected Behavior/ Acceptance Criteria:
Ideally, the above setup should have worked and the SDK should have retried to the region P2 to get the collection information from the gateway. Note that this behavior is expected irrespective to the fact that per partition automatic failover is enabled or not.
Scope:
CosmosClient
set inDirect
mode.CosmosClient
, when all the metadata information are needed to be fetched and cached.High Level Changes:
TaskCancelledException
which is eventually wrapped in aCosmosException
from theCosmosHttpClient
. The idea is to extend the retry policy to retry on CosmosExceptions.MetadataRequestThrottleRetryPolicy
, which is a wrapper around theRequestThrottleRetryPolicy
and particularly handles all of the metadata requests. The purpose is to mark an endpoint unavailable for read, when a gateway timeout occurs, so that the next retry could happen on another region.Design Approach:
Current Flow:
Sample Diagnostics with Current Flow:
Scenario: Master Partitions are in Complete Quorum Loss.
Proposed Flow:
Complete Diagnostics After the Above Design Changes:
Scenario: Master Partitions are in Complete Quorum Loss.
Known Issues:
GET Document
calls to fetch the address intermittently fails, when the master partition is in complete quorum loss. This is a known issue, and the Routing Gateway team is currently investigating this.The text was updated successfully, but these errors were encountered: