Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal] Researching capabilities to support PPAF for .NET, Java SDK and Compute Gateway #3499

Closed
5 tasks done
philipthomas-MSFT opened this issue Oct 12, 2022 · 7 comments
Assignees
Labels
automatic failover documentation Engineering engineering improvements (CI, tests, etc.) gallium-semester Gallium Semester Deliverable improvement Change to existing functional behavior (perf, logging, etc.) PerPartitionAutomaticFailover

Comments

@philipthomas-MSFT
Copy link
Contributor

philipthomas-MSFT commented Oct 12, 2022

Purpose statement

This is a document to enhance the Cosmos DB experience by achieving even higher availability.

Description: The Microsoft Azure Cosmos DB .NET SDK Version 3 plan to support per-partition automatic failover for data and control plane operations that are requested to server and master partitions, respectively for strong consistency. There will be a separate document to address the Java SDK. It also must be understood that this scope must reach over to Compute Gateway and Cassandra over Compute.

Tasks

  • Identifying stakeholders
  • Understanding source current architecture
  • Detailing the target future architecture and gaps
  • Describing the testing strategy
  • Identifying performance and security concerns
  • Defining criteria for per-partition automatic failovers
  • Defining out of scope and future states
  • Architecture katas
  • Outline supportability, manageability, and configuration
  • Providing visual aids (diagrams, flow charts, C4, etc.)

Stakeholders

  • Routing Gateway
    • Dinesh Billa
  • Backend / HA
    • Mikael Horal
    • Abhishek Kumar
    • Andres Araya
    • Josh Rowe
  • Compute Gateway
    • Vinod Sridharan
  • Azure Cosmos DB .NET SDK for NoSQL
    • Philip Thomas
    • Fabian Meiswinkel
  • ARM Management Workflow
    • Naveen Verma
  • Other collaborations
    • Matias Quaranta - General knowledge
    • Kiran Kumar Kolli - General knowledge
    • Sourabh Jain - Client telemetry
    • Arooshi Avasthy - Distributed tracing

Resources

Out of scope

Scope of work

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to achieve higher availability for strong consistency by implementing per-partition automatic failover for both single-region write accounts. The premise is that if communications to a partition, either server or master, meets the criterion for per-partition automatic failover, then the SDK will automatically try to promote the next available read region to a write region. This must also work for Cassandra API over Compute. Although other Cosmos DB APIs are not supported initially, the shared code base between SDK client and Compute should automagically work for them.

Strong consistency is supported in this iteration of development due to its guaranteed reads from the most recent committed version of an item. Although other consistency levels are pretermitted, there will be plans to support them in the future.

[insert visual aid here]

Before we continue, for clarity and level-setting, a server partition is a partition that is used for data-plane document operations. A master partition is a partition used for control-plane (database, collection, etc.) and meta data (account) operations for building and accessing location, collection and partition key range caches. Next, we will talk about the criteria for per-partition automatic failover.

Criteria for per-partition automatic failover

It is important to note, that there are other sub status codes that exist for these statuses. There are offline conversations happening with collaborating teams to determine if we need to expand our criteria for per-partition automatic failover or continue to pretermit them, but as of this moment, this the complete list. I will list the pretermitted HTTP statuses below.

  • HTTP statuses

  • Modes, operations, and http statuses

    • Direct mode, Data-plane operations that yield a Service Unavailable/Unknown (503.0) HTTP status
    • Direct mode, Data-plane operations that yield a Forbidden/WriteForbidden (403.3)
    • Direct mode, Data-plane operations that yield a Request Timeout (408)
    • Gateway mode, Data-plane operations that yield a Service Unavailable/Unknown (503.0) HTTP status
    • Gateway mode, Data-plane operations that yield a Forbidden/WriteForbidden (403.3)
    • Gateway mode, Data-plane operations that yield a Request Timeout (408)
    • Gateway mode, Control-plane operations that yield a Service Unavailable/Unknown (503.0) HTTP status
    • Gateway mode, Control-plane operations that yield a Forbidden/WriteForbidden (403.3)
    • Gateway mode, Control-plane operations that yield a Request Timeout (408)
    • Gateway mode, Control-plane meta data operations that yield a Service Unavailable/Unknown (503.0) HTTP status
    • Gateway mode, Control-plane meta data operations that yield a Forbidden/WriteForbidden (403.3)
    • Gateway mode, Control-plane meta data operations that yield a Request Timeout (408)
  • Pretermitted HTTP sub statuses

    • Service Unavailable (503)

      • InsufficientBindablePartitions (1007)
      • ComputeFederationNotFound (1012)
      • OperationPaused (9001)
      • ServiceIsOffline (9002)
      • InsufficientCapacity (9003)
      • ServerGenerated503 (21008)
    • Forbidden (403)

      • ProvisionLimitReached (1005)
      • DatabaseAccountNotFound (1008)
      • RedundantCollectionPut (1009)
      • SharedThroughputDatabaseQuotaExceeded (1010)
      • SharedThroughputOfferGrowNotNeeded (1011)
      • PartitionKeyQuotaOverLimit (1014)
      • SharedThroughputDatabaseCollectionCountExceeded (1019)
      • SharedThroughputDatabaseCountExceeded (1020)
      • ComputeInternalError (1021)
      • ThroughputCapQuotaExceeded (1028)
      • InvalidThroughputCapValue (1029)
      • RbacOperationNotSupported (5300)
      • RbacUnauthorizedMetadataRequest (5301)
      • RbacUnauthorizedNameBasedDataRequest (5302)
      • RbacUnauthorizedRidBasedDataRequest (5303)
      • RbacRidCannotBeResolved (5304)
      • RbacMissingUserId (5305)
      • RbacMissingAction (5306)
      • RbacRequestWasNotAuthorized (5400)
      • NspInboundDenied (5307)
      • NspAuthorizationFailed (5308)
      • NspNoResult (5309)
      • NspInvalidParam (5310)
      • NspInvalidEvalResult (5311)
      • NspNotInitiated (5312)
      • NspOperationNotSupported (5313)
    • Request Timeout (408)

It also noted that since certain Gone (410) HTTP statuses and sub statuses are converted to Service Unavailable (503), they are eligible for per-partition automatic failover while others are not. Please refer to SdkDesign for more information.

Current base architecture

Currently we have support to per-partition automatic failover to regions in a couple of ways that give us limited to optimal support for successful per-partition automatic failover. Please refer to #2395.

The 1st being the control-plane meta data account information that is HTTP requested via the global account endpoint. If the SDK client is cold, which means that it is initialized for the first time, the SDK client has access to regions/locations that were managed and configured on the account level. If the SDK client is hot, which means that it has already been initialized on a previous request, the SDK client has access to the regions/locations that are cached in LocationCache to avoid making future HTTP requests to the gateway endpoint again. There are some triggers that invoke a refresh.

The 2nd being the ApplicationPreferredRegions on CosmosClientOptions that is set during design time by the customer within the SDK client.

When having both of these to leverage, the SDK client will give you the most optimal form of per-partition automatic failover when the failure criteria is met. It is also important to note that the EnablePartitionLevelFailover boolean flag must set to true on CosmosClientOptions in order for the logic for per-partition automatic failover to be executed. Having just one or the other gives the SDK client limited per-partition automatic failover support. Having neither will gives the SDK client no per-partition automatic failover support, and we will talk about that next.

Here is a more detailed breakdown and analysis of the current baseline architecture.

  • LIMITED: The current state of all cold start master partition failures that have application preferred regions set in CosmosClientOptions, irrespective of connectivity mode (Direct/Gateway) or operation (Data/Control), have limited failover to just the application preferred regions.
    • Reason: Account level regions are not set because the master partition could not be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
    • Reason: Application preferred regions where set in the CosmosClientOptions.
  • LIMITED: The current state of all warm start master partition failures that do not have application preferred regions set in CosmosClientOptions, have limited failover to the account level regions, irrespective of connectivity mode or operation, irrespective of connectivity mode or operation.
    • Reason: Account level regions are set because the master partition could be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
    • Reason: Application preferred regions where not set in the CosmosClientOptions.
  • LIMITED: The current state of cold and warm start server partition failures that do not have application preferred regions set in CosmosClientOptions, have limited failover to account the account level regions, irrespective of connectivity mode or operation.
    • Reason: Account level regions are set because the master partition could be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
    • Reason: Application preferred regions where not set in the CosmosClientOptions.
  • BEST OPTIMAL OUTCOME: The current state of warm start master and server partition failures that have application preferred regions set in CosmosClientOptions, have a better failover strategy using both the application preferred regions and the account level regions, irrespective of connectivity mode or operation.
    • Reason: Account level regions are set because the master partition could be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
    • Reason: Application preferred regions where set in the CosmosClientOptions.
  • BEST OPTIMAL OUTCOME: The current state of cold start server partition failures that have application preferred regions set in CosmosClientOptions, have a better failover strategy by using both the application preferred regions and the account level regions, irrespective of connectivity mode or operation.
    • Reason: Account level regions are set because the master partition could be reached to accept the control-plane meta data requests responsible for returning a list of available regions.
    • Reason: Application preferred regions where set in the CosmosClientOptions.

No per-partition automatic failover cases

For those cases where the SDK client is cold and the criterion for per-partition automatic failover is met while attempting to request control-plane meta data (account) information to access regions/locations, and the customer has not set ApplicationPreferredRegions on CosmosClientOptions, there is no per-partition automatic failover support, and will usually result in a online support call or a manual failover. For clarity and level-setting, a manual failover is when a read region is intentionally and manually, via Azure Portal, promoted to a write region, and the defaulted/preferred write region that is offline is demoted to a read region when it comes back online. To learn more, please refer to High Availability.

[insert visual aid here]

Proposed solution

It would be advantageous to enhance per-partition automatic failover within the SDK client by introducing DNS TXT records that is both configured and managed by the current ARM management workflow. More on this below below. The routing gateway team has already adopted this as a solution and is currently responsible for creating the DNS TXT records. It is up to the SDK team to enhance the SDK client to leverage these DNS TXT records in the event if there is no way to access account information from the gateway endpoints. The DNS TXT records will include other regional account names that the SDK client can iterate and cache once a successful request has been achieved. Next, we will talk about the 2 most reasonable solutions for querying DNS TXT records within the SDK client.

For clarity and level-setting, DNS TXT records are a type of Domain Name System (DNS) record in text format, which contain information about your domain.

[insert visual aid here]

Branch

https://github.com/Azure/azure-cosmos-dotnet-v3/tree/users/philipthomas-MSFT/per-partition-failover-dns-query-txt-records

DNS TXT record

Key (Global database account endpoint)

testaccount.srd.documents.azure.com

Value

{
	"domainName": "documents.azure.com",
	"globalDatabaseAccountName": "testaccount",
	"orderedRegionalAccountNames": [
		"testaccount-wus",
		"testaccount-eus",
		"testaccount-scus"
	]
}

Configuration and management

  • Owned by the Routing Gateway and ARM Management Workflow Teams
    • On msdata repository
      • ProvisionDatabaseAccountWorkflow2.cs
      • DatabaseAccountTXTRecord.cs
      • PartitionFailoverRetryPolicy.cs
      • PartitionFailoverRetryPolicyTests.cs
      • GatewayDNSRecordProvider.cs
    • Once account has been updated, and the PPAF is enabled on the account, DNS TXT records should be available at a maximum of 60 seconds.

Shading DNS client inside of SDK client

  • Pros
  • Cons

Shading DNS client inside hosted federated server

  • Pros
  • Cons
    • Setup meeting to talk about why this is against best practices to host DNS lookups in contrast to local client. Include Josh Rowe.

Further below is a larger "exhaustive" table of dns solutions that in one way or the other, has more pros than cons.

Open-source software

Performance

  • Latency
    • Increase in latency is expected as attempts to communicate to potential endpoints is necessary.
  • Caching
    • On the chance that the criterion for per-partition automatic failover is met, the new write region endpoint that the SDK client uses to successfully make a request will be cached to prevent the SDK client from needing to perform another DNS TXT record query and iterating through the list of regional account names.

Security

  • Validating
    • On the chance that the criterion for per-partition automatic failover is met, and the DNS TXT record is queried, all regional account names must undergo some form of validation to avoid any opportunities for DNS spoofing, or man-in-the-middle attacks.

Areas of impact

Supportability

Client telemetry
Distributed tracing

TBD

Diagnostic logging
  • Clarification
    • Is the per-partition automatic failover logic including FailedReplicas corrected?
    • Should RegionsContacted have a list of regions that where contacted?
    • Create and issue for ContactedReplicas.
    • StoreResponseStatistics should show failed endpoints.
Sample Diagnostics
  
    {
	    "Summary": {
		    "DirectCalls": {
			    "(201, 0)": 1
		    }
	    },
	    "name": "CreateItemAsync",
	    "start datetime": "2023-06-08T14:58:45.537Z",
	    "duration in milliseconds": 0.3899,
	    "data": {
		    "Client Configuration": {
			    "Client Created Time Utc": "2023-06-08T14:58:45.1839346Z",
			    "MachineId": "hashedMachineName:25bbdc53-3a51-4190-8877-5eafa4f5e7ac",
			    "NumberOfClientsCreated": 1,
			    "NumberOfActiveClients": 1,
			    "ConnectionMode": "Direct",
			    "User Agent": "cosmos-netstandard-sdk/3.34.0|1|X64|Microsoft Windows 10.0.22621|.NET 6.0.16|L|F 00000010|",
			    "ConnectionConfig": {
				    "gw": "(cps:50, urto:10, p:False, httpf: True)",
				    "rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
				    "other": "(ed:False, be:False)"
			    },
			    "ConsistencyConfig": "(consistency: Strong, prgns:[East US, West US], apprgn: )",
			    "ProcessorCount": 12
		    }
	    },
	    "children": [
		    {
			    "name": "ItemSerialize",
			    "duration in milliseconds": 0.0313
		    },
		    {
			    "name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
			    "duration in milliseconds": 0.3051,
			    "children": [
				    {
					    "name": "Get Collection Cache",
					    "duration in milliseconds": 0.0004
				    },
				    {
					    "name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
					    "duration in milliseconds": 0.248,
					    "data": {
						    "System Info": {
							    "systemHistory": [
								    {
									    "dateUtc": "2023-06-08T14:58:45.1271989Z",
									    "cpu": 3.442,
									    "memory": 32641284.0,
									    "threadInfo": {
										    "isThreadStarving": "no info",
										    "availableThreads": 32764,
										    "minThreads": 12,
										    "maxThreads": 32767
									    },
									    "numberOfOpenTcpConnection": 0
								    }
							    ]
						    }
					    },
					    "children": [
						    {
							    "name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
							    "duration in milliseconds": 0.2427,
							    "children": [
								    {
									    "name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
									    "duration in milliseconds": 0.2367,
									    "children": [
										    {
											    "name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
											    "duration in milliseconds": 0.2353,
											    "children": [
												    {
													    "name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
													    "duration in milliseconds": 0.1958,
													    "data": {
														    "Client Side Request Stats": {
															    "Id": "AggregatedClientSideRequestStatistics",
															    "ContactedReplicas": [
																    {
																	    "Count": 1,
																	    "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859499990p/"
																    },
																    {
																	    "Count": 1,
																	    "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859470000s/"
																    },
																    {
																	    "Count": 1,
																	    "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859471110s/"
																    },
																    {
																	    "Count": 1,
																	    "Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859472220s/"
																    }
															    ],
															    "RegionsContacted": [],
															    "FailedReplicas": [],
															    "AddressResolutionStatistics": [],
															    "StoreResponseStatistics": [
																    {
																	    "ResponseTimeUTC": "2023-06-08T14:58:45.5379391Z",
																	    "ResourceType": "Document",
																	    "OperationType": "Create",
																	    "LocationEndpoint": "https://testserviceunavailableexceptionscenarioasync-westus.documents.azure.com/",
																	    "StoreResult": {
																		    "ActivityId": "5abbc286-b504-431f-85f5-00f595a9644c",
																		    "StatusCode": "Created",
																		    "SubStatusCode": "Unknown",
																		    "LSN": 58593,
																		    "PartitionKeyRangeId": "1",
																		    "GlobalCommittedLSN": 58593,
																		    "ItemLSN": -1,
																		    "UsingLocalLSN": false,
																		    "QuorumAckedLSN": -1,
																		    "SessionToken": null,
																		    "CurrentWriteQuorum": -1,
																		    "CurrentReplicaSetSize": -1,
																		    "NumberOfReadRegions": -1,
																		    "IsValid": true,
																		    "StorePhysicalAddress": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859499990p/",
																		    "RequestCharge": 0,
																		    "RetryAfterInMs": null,
																		    "BELatencyInMs": null,
																		    "transportRequestTimeline": null,
																		    "TransportException": null
																	    }
																    }
															    ]
														    }
													    }
												    }
											    ]
										    }
									    ]
								    }
							    ]
						    }
					    ]
				    }
			    ]
		    },
		    {
			    "name": "Response Serialization",
			    "duration in milliseconds": 0.0283
		    }
	    ]
    }
  
 

Testing

Use cases/scenarios

Please use Gherkin syntax (Given, When and Then)

Critical paths where there is no per-partition automatic failover support in the current baseline architecture

Cold SDK client, explicit data-plane operation, Implicit control-plane meta data (account) point of failure

Given a customer wants to create a new item (CreateItemAsync) using a cold Microsoft Azure Cosmos DB .NET SDK client,
    And the CosmosClientOptions has EnablePartitionLevelFailover set to true,
    And the ConnectivityMode is set to Direct or Gateway mode,
    And the CosmosClientOptions does not have ApplicationPreferredRegions set,
When the SDK client attempts to request control-plane meta data (account) information from the global database account gateway endpoint, the response is a ServiceUnavailable/Unknown, Forbidden/WriteForbidden, or a RequestTimeout HTTP status.
Then the SDK client will query DNS to get regional account names from the DNS TXT record based on the global database account endpoint,
    And the SDK client will iterate, validate and attempt to communicate with all region account names in the DNS TXT record until a read region is online,
    And the SDK client will promote that read region to primary write region,
    And the SDK client will cache the primary write region,
    And the original write region will demote to a read region once it is online.

Cold SDK client, explicit control-plane operation, implicit control-plane meta data (account) point of failure

Given a customer wants to create a new collection (CreateCollectionAsync) using a cold Microsoft Azure Cosmos DB .NET SDK client,
    And the CosmosClientOptions has EnablePartitionLevelFailover set to true,
    And the ConnectivityMode is set to Direct or Gateway mode,
    And the CosmosClientOptions does not have ApplicationPreferredRegions set,
When the SDK client attempts to request control-plane meta data (account) information from the global database account gateway endpoint, the response is a ServiceUnavailable/Unknown, Forbidden/WriteForbidden, or a RequestTimeout HTTP status.
Then the SDK client will query DNS to get regional account names from the DNS TXT record based on the global database account endpoint,
    And the SDK client will iterate, validate and attempt to communicate with all region account names in the DNS TXT record until a read region is online,
    And the SDK client will promote that read region to primary write region,
    And the SDK client will cache the primary write region,
    And the original write region will demote to a read region once it is online.

Unit (Gated pipeline)

  • DNS TXT record querying
  • Criterion for per-partition automatic failover
  • Caching, accessing and invalidating promoted write region
  • Feature flags

Emulator (Gated pipeline)

  • E2E

Performance/Benchmarking (Gated pipeline)

  • E2E
  • DNS TXT record querying

Security/Penetration (Gated pipeline)

  • DNS TXT record querying

DNS solution comparison matrix

  A B C D
  Reference DnsClient.NET OSS libraries in SDK Use native API in SDK Shade DnsClient.NET OSS in SDK Ask DNS team to add query capabilities
Pros
  • Community support to address any issues and/or bugs
  • SDK team does not need to be an DNS expert
  • DNS querying is hidden from CX
  • Code is owned by the SDK team and would require a certain level of expertise
  • Can write code using iphlpapi.dll interop, specifically GetAdaptersAddresses and GetNetworkParams
  • DNS querying is hidden from CX
  • Code can exist within the Azure Cosmos DB .NET SDK
  • Code is owned by the SDK team and would require a certain level of expertise
  • No need to write own code to support DNS querying
  • DNS querying is hidden from CX
  • DNS querying is supported by the DNS team's existing library and not the SDK team and can be leverage for other initiative that require this type of capability
  • No need to write own code to support DNS querying
  • DNS querying is hidden from CX
Cons
  • Dependency requires Azure Central approval
  • SDK team would have to become SMEs for DNS
  • This would be treated as a new project initiative
  • Must meet multiple cross functional requirements and standards (Performance, Security, operating system Supportability, etc.)
  • We would need to duplicate this across all Azure Cosmos DB SDKs
  • Any enhancements and bug fixes applied to the originating DnsClient.NET repo would need to be applied to the Azure Cosmos DB .NET SDK manually so staying fresh and up to date would be problematic
  • Java and other Azure Cosmos DB SDKs would need separate implementations and could not leverage DnsClient.NET for DNS querying for SRV records
  • If the DNS team agreed to this, we are forced into the availability to development and timeline to deliver which would drastically affect the timeline and milestones for this project initiative
 Notes DnsClient Github OS   DnsClient Github OS LookupClient  
 
  E F G H
  Add endpoint to ToolsFederation Add endpoint to Compute Gateway Callback Function Plugin Architecture
Pros
  • DNS querying is supported by the DNS team's existing library and not the SDK team and can be leverage for other initiative that require this type of capability.
  • The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.)
  • The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.)
  • The signature and output would be consistent
  • The implementation can be shared across all Azure Cosmos DB SDKs (.NET, Java, etc.)
  • The signature and output would be consistent
  • Can use plugin using AssemblyDependencyResolver
Cons
  • It is considered a bad practice to perform DNS queries that are not resolved locally because of the options that system administrators could configure for DNS
    • looking for documentation to support this claim
  • It is considered a bad practice to perform DNS queries that are not resolved locally because of the options that system administrators could configure for DNS
    • looking for documentation to support this claim
  • Each CX is responsible for implementing "how" the DNS querying would work based on code the CX has to write
  • Strict guidelines would be difficult to enforce
  • The work required to actually achieve the SRV record is totally up to the CX
  • Supportability is less than desirable. I can foresee many support incidents coming to the SDK team
  • We would need to duplicate this across all Azure Cosmos DB SDKs
  • Each CX is responsible for implementing "how" the DNS querying would work based on code the CX has to write
  • Strict guidelines would be difficult to enforce
  • The work required to actually achieve the SRV record is totally up to the CX
  • Supportability is less than desirable. I can foresee many support incidents coming to the SDK team
  • We would need to duplicate this across all Azure Cosmos DB SDKs
 Notes Hosted on the ToolsFederation (same as ClientTelemetry endpoint) owned by Pedro Balaguer. negotiate who will do the work. Short term. Regional endpoints. Hosted on Compute Gateway owned by Dinesh . Long term. Regional endpoints. We would have to give some direction to customer on how to write code to query DNS and using callback function We would have to give some direction to customer on how to write code to query DNS and using IOC plugin architecture
 

Tasks

  1. PerPartitionAutomaticFailover
    kundadebdatta
  2. PerPartitionAutomaticFailover
    kundadebdatta
  3. PerPartitionAutomaticFailover
    kundadebdatta
  4. PerPartitionAutomaticFailover
    kundadebdatta
  5. PerPartitionAutomaticFailover
    kundadebdatta
@philipthomas-MSFT philipthomas-MSFT added documentation Engineering engineering improvements (CI, tests, etc.) improvement Change to existing functional behavior (perf, logging, etc.) automatic failover labels Oct 12, 2022
@philipthomas-MSFT philipthomas-MSFT added this to the Azure Cosmos SDKs milestone Oct 12, 2022
@philipthomas-MSFT philipthomas-MSFT self-assigned this Oct 12, 2022
@philipthomas-MSFT philipthomas-MSFT changed the title [Internal] Designing and adding capabilities to support using SRV for PPAF [Internal] Researching capabilities to support PPAF by reading DNS records Apr 18, 2023
@philipthomas-MSFT philipthomas-MSFT changed the title [Internal] Researching capabilities to support PPAF by reading DNS records [Internal] Researching capabilities to support PPAF for .NET, Java SDK and Compute Gateway May 5, 2023
@philipthomas-MSFT philipthomas-MSFT added the gallium-semester Gallium Semester Deliverable label Jun 14, 2023
@mikaelhoral-microsoft
Copy link

mikaelhoral-microsoft commented Jun 16, 2023

Scope of work section

Suggestion:

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to achieve higher availability for strong consistency by implementing per-partition automatic failover for both single-region write accounts. The premise is that if communications to a partition, either server or master, meets the criterion for per-partition automatic failover, then the SDK will automatically try to promote the next available read region to a write region.

...

Strong consistency is supported in this iteration of development due to its guaranteed reads from the most recent committed version of an item. Although other consistency levels are pretermitted, there will be plans to support them in the future.

I would reword this. For example:

The Microsoft Azure Cosmos DB .NET SDK Version 3 for NoSQL API needs to be able to correctly respond to a backend partition failing over for single-region write accounts in order to achieve higher availability. Upon a backend partition - either server or master - failover, the SDK needs to automatically detect this condition and redirect subsequent write requests to the new write region for the partition. Per-partition automatic failover is initially rolled out only for select Strong Consistency accounts, but will later be available for all consistency levels.

@mikaelhoral-microsoft
Copy link

Criteria for per-partition automatic failover

Q: What do you mean by "Pretermitted HTTP sub statuses"? Are these HTTP status codes for which we explicitly do not want to failover?

@mikaelhoral-microsoft
Copy link

More generally, the Cosmos DB Core team proposes a solution where we should always attempt to retry any error in a different region (based on a region priority list), possibly after first retrying on a different in-region replica first UNLESS the error is a very specific one which clearly indicates that we shouldn't retry in a different region (e.g. a split where backend returns 410.1002). One rationale for this approach is because the Backend can't possibly know about each possible error condition and there are many errors the Backend is not in control over (network stack, service fabric just to name a few). I also spoke to Fabian about this and he agrees; a couple of points from my discussion with him:

  • There are error codes such as 403.3 that are surefire indications that we should retry in a different region - we should do that.
  • There are other errors codes (e.g., 410.1002, or say a user error code saying that "document too large" in which case we should not)
  • A lot of error codes may indicate that we should try a different replica in same region and/or different region - the logic for PPAF ought to more closely resemble what we already do in the multi-writer case here!

As for refreshing of partition state we propose that this is not done for PPAF. Once we establish that a partition has failed over (e.g., from region A to B) the SDK should add an override for that partition (pk range); this override remains in place until we see failures in the failover region (e.g. 403.3 in case of a "clean" failover) in which case we will retry regions again in priority order. Once we successfully establish a new region for the pk range we either clear override or add a new region as the override. What this means is that there is no need to talk to the PPAF "Fault Tolerant Store" (CASPaxos store); this is preferred given that CAXPaxos store is not scaled to handling high traffic loads and furthermore retries on a different replica and or region is cheap. Spoke to Fabian about this as well and we are largely aligned here; one concern he raised is that it will add more "speculative" attempts which may have an impact on customer RU consumption for example however the fact that customer needs to explicitly opt-in to PPAF largely alleviates this.

Best practices such as trying with exponential back-off should be employed. As for retrying writes, we should only do this on very specific error codes where we can be assured that the backend rejected the writes (403.3 for example).

@philipthomas-MSFT
Copy link
Contributor Author

Based on our discussion @mikaelhoral-microsoft . Does this apply to just SDK, or also Routing Gateway? Because Routing Gateway is following the same criteria as we are with looking for specific status/substatus codes.

@mikaelhoral-microsoft
Copy link

Applies equally to Routing Gateway

@mikaelhoral-microsoft
Copy link

As discussed, in "No per-partition automatic failover cases" let's point to https://msdata.visualstudio.com/CosmosDB/_workitems/edit/2475521

@mikaelhoral-microsoft
Copy link

We are tracking SDK backlog items here: https://msdata.visualstudio.com/CosmosDB/_workitems/edit/2484362. These are from PPAF testing conducted by Backend team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automatic failover documentation Engineering engineering improvements (CI, tests, etc.) gallium-semester Gallium Semester Deliverable improvement Change to existing functional behavior (perf, logging, etc.) PerPartitionAutomaticFailover
Projects
Status: Done
Development

No branches or pull requests

3 participants