[FEATURE] Cosmos | Connection endpoint rediscovery (connection state listener) #14697

David-Noble-at-work · 2020-09-01T20:49:21Z

TODO

Feature description
Read latency benchmarks
Collected and analyzed numbers from our prod-03 environment
See read latency benchmark results
Socket utilization numbers
Collected numbers from our prod-03 environment (As part of read latency benchmark run)
See read latency benchmark results
Memory utilization numbers
Collected numbers from our prod-03 environment (As part of read latency benchmark run)
See read latency benchmark results
CPU utilization numbers
Collected numbers from our prod-03 environment (As part of read latency benchmark run)
See read latency benchmark results
Thought experiment that builds confidence that this feature corrects other read latency issues we've been seeing
Reviewed with @xinlian12, @kushagraThapar, @moderakh
Upgrade test results with evidence that this feature functions as expected without performance or reliability issues.
Collecting data from our test-33 environment
Next up: Analyze results

This PR will be ready for merge when we're convinced that the feature is functionally correct and does not introduce performance or reliability issues.

Purpose

The connection endpoint rediscovery feature is designed to reduce and spread-out high latency spikes that are likely to occur:

During rolling upgrades of a Cosmos instance or
When a backend node is being decommissioned or restarted (e.g., to restart or remove an unhealthy replica)

Our expectation is that this PR will improve reliability without effecting performance under normal circumstances when there is no rolling upgrade taking place and no servers being decommissioned or restarted.

Implementation

The GlobalAddressResolver now implements AddressResolverExtension. This new interface extends IAddressResolver and adds methods to support the RntbdConnectionStateListener. The RntbdTransportClient now uses the RntbdConnectionStateListener to
build a reverse lookup table that maps physical endpoint addresses to sets of PartitionKeyRangeIdentity instances. New endpoint addresses are added each time that RntbdTransportClient::invokeStoreAsync is called.

When we detect that a server may be down or going down, we remove the effected PartitionKeyRangeIdentity instances from the physical address cache and close the effected RntbdEndpoint. The physical addresses that service an effected PartitionKeyRangeIdentity will then be updated on the next request as if it were the first request targeting the PartitionKeyRangeIdentity. Here are the conditions that RntbdTransportClient::invokeStoreAsync detects:

A graceful shutdown is occurring.

This is indicated by a GoneException with sub-status code zero (SubStatusCodes.UNKOWN).
The service is down.

This is indicated by a GoneException with non-null cause. We expect an IOException as the cause. Common causes of type IOException are:
- ConnectTimeoutException
  
  Indicates that an attempt to connect to a remote host timed out.
- ConnectException
  
  Indicates that the server can't be reached.
- ClosedChannelException
  
  Indicates a connection dropped.
- IOException
  
  Indicates a connection was reset by the remote host.

Endpoint closure is aggressive. We close an endpoint (all channels) at the time that a connection issue is detected on any channel. We will consider less aggressive strategies based on what we find in test. We are cognizant of the fact that unnecessary RntbdEndpoint evictions could cause a flood of retries. We will adapt to what we find in test.

This feature is hidden behind a new feature flag: DirectConnectionConfig::enableConnectionEndpointRediscovery (default: false). When we're satisfied with the implementation, we'll change the default to true.

Read Latency Benchmark results

All benchmark results were produced in the prod-03 test environment:

Cosmos DB instance:

cosmos-sdk-core-3
Virtual machine:

Linux (ubuntu 18.04)

Standard F16s_v2 (16 vcpus, 32 GiB memory)

Here is the raw data and a Jupyter notebook for further analysis.

Charts

P95, Document count: 1

P95, Document count: 100,000

P99, Document count: 1

P99, Document count: 100,000

P99.9, Document count: 1

P99.9, Document count: 100,000

Throughput, Document count: 1

Throughput, Document count: 100,000

Socket utilization, Document count: 1

This chart shows socket counts by concurrency level with documentCount 1. Socket counts are summed over the set of observations taken at the end of each of five one minute periods in the benchmark. Socket counts were collected using ss. Divide the socket count by five to get the average count at the end of each period.

Socket utilization, Document count: 100,000

This chart shows socket counts by concurrency level with documentCount 100,000. Socket counts are summed over the set of observations taken at the end of each of five one minute periods in the benchmark. Socket counts were collected using ss. Divide the socket count by five to get the average count at the end of each period.

Virtual memory utilization, Document count: 1

This chart shows virtual memory utilization in mebibytes by concurrency level with documentCount 1. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. Virtual memory numbers were collected using top.

Virtual memory utilization, Document count: 100,000

This chart shows virtual memory utilization in mebibytes by concurrency level with documentCount 100,000. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. Virtual memory numbers were collected using top.

CPU utilization, Document count: 1

This chart shows CPU utilization as a percentage by concurrency level with documentCount 1. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. CPU utilization numbers were collected using top.

CPU utilization, Document count: 100,000

This chart shows CPU utilization as a percentage by concurrency level with documentCount 100,000. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. CPU utilization numbers were collected using top.

Later: In a follow-on PR

Distinguish between requests that fail because of a connection issue:
- Writes (aka, sends) that do not reach the server
- Writes (aka, sends) that reach the server (i.e., confirmed by TCP ACK)
Rationale: enables retries for create/delete/update operations that are known not to reach the server. TransportException was developed with this in mind. This info can be extracted from the RequestTimeline and stored into the exception.
Ensure that Cosmos client never instantiates GoneException with sub-status code zero.

Consider whether the client should instantiate any CosmosException using a sub-status code that may be returned in a response from the server.

…-java

…exit from Main.java

…-java

…modify default Direct TCP options

…-java

…into feature/cosmos-4.2/connection-state-listener

…mized rntbd imports, and disambiguated a couple of diagnostics messages in RntbdClientChannelPool.

…ture/cosmos-4.2/connection-state-listener

moderakh

See the description which now includes the substance of this comment.

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ConnectionPolicy.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/HttpConstants.java

moderakh · 2020-09-01T22:23:38Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RequestTimeline.java

@@ -156,13 +156,10 @@ public String toString() {
        @JsonIgnore
        private final Duration duration;

-        @JsonSerialize(using = ToStringSerializer.class)
-        private final long durationInMicroSec;


why are we removing this? I found this very helpful.

It is reported, just not stored as a value. Are you saying that it is useful for debugging?

as durationInMicroSec is removed, does the PR changes the information available in rntbd request diagnostics?

if so could you provide a sample on new rntbd request timeline diagnostics format, info?

The value will still be logged:
"serializationDiagnosticsContext":{"serializationDiagnosticsList":[{"serializationType":"ITEM_SERIALIZATION","startTimeUTC":"22 Sep 2020 17:40:59.239","endTimeUTC":"22 Sep 2020 17:40:59.239","durationInMicroSec":0},{"serializationType":"PARTITION_KEY_FETCH_SERIALIZATION","startTimeUTC":"22 Sep 2020 17:40:59.317","endTimeUTC":"22 Sep 2020 17:40:59.323","durationInMicroSec":5998}]},"gatewayStatistics":null,"systemInformation":{"usedMemory":"60586 KB","availableMemory":"8309590 KB","systemCpuLoad":"(2020-09-22T17:40:58.729927700Z 20.1%)"}}

moderakh · 2020-09-01T22:44:30Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java

-                    if (collection == null) {
-                        throw new IllegalStateException("Collection cannot be null");
-                    }
+                   .flatMap(documentCollectionResourceResponse -> {


why do we need to change the readMany API?

please undo this.

this is not related to this PR. could you please undo reformatting unrelated code?

code style reformatting in unrelated code makes it hard to follow/discover/validate the logic change.

if there is issue with code style in existing unrelated code. Please don't include the fix for that here. code style change should go to a different PR.

...c/main/java/com/azure/cosmos/implementation/directconnectivity/AddressResolverExtension.java

.../src/main/java/com/azure/cosmos/implementation/directconnectivity/GlobalAddressResolver.java

...c/main/java/com/azure/cosmos/implementation/directconnectivity/AddressResolverExtension.java

kirankumarkolli · 2020-09-03T07:01:58Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java

@@ -79,6 +81,28 @@ public DirectConnectionConfig setConnectTimeout(Duration connectTimeout) {
        return this;
    }

+    /**
+     * Gets a value that indicates whether Direct TCP connection endpoint rediscovery should is enabled.


Expand on what it is from CX perspective.

QQ; Do we want this feature flag as prominent in ConnectionConfig?

It is a good question whether we want this flag as prominent as it is. This PR now enables connection endpoint rediscovery by default. I would prefer not to advertise the feature until we've got more experience with it. Enabling the feature by default is counter to that. One might argue that enabling the feature by default with a highly visible option to turn the feature off is preferred because it's easier to do that in code than by using a less obvious mechanism (such as azure.cosmos.directTcp.defaultOptions).

@kushagraThapar, @moderakh what do you think?

IMO, any feature may have bug on its first release. I prefer if we change the default to enabled later when we are more confident on the behaviour.

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ConnectionPolicy.java

kirankumarkolli · 2020-09-03T07:07:26Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/HttpConstants.java

+        // Sub-status code zero in a response from a service endpoint indicates that a replica is being discontinued or
+        // reconfigured. When endpoint rediscovery is enabled the RntbdTransportClient converts sub-status code zero to
+        // this sub-status code value.
+        public static final int DISCONTINUING_SERVICE = CLIENT_GENERATED + 2;


How doe these sub-status code impact GoneRetryPolicy?

Confirmed by way of the logs on test33 and also by code inspection. This does not effect retry policy.

when rntbd throws GoneException, Gone will reach GoneAndRetryWithRetryPolicy, and it will decide how it should get retried, based on status/code and substatus code.

Please check the behaviour of GoneAndRetryWithRetryPolicy if for the given statuscode/substatus code it does the desired behabour or not.

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RequestTimeline.java

kirankumarkolli · 2020-09-03T07:55:34Z

...os/src/main/java/com/azure/cosmos/implementation/directconnectivity/GatewayAddressCache.java

@@ -135,6 +137,16 @@ public GatewayAddressCache(
             DefaultSuboptimalPartitionForceRefreshIntervalInSeconds);
    }

+
+    @Override
+    public void removeAddresses(final PartitionKeyRangeIdentity partitionKeyRangeIdentity) {


Is remove un-conditional or conditional?

I.e. in-case of high throughut clients simple remove might result in removing existing valid refresh entries.
Do they need to be based on the stale state instead?

Otherwise it might load Gateway for more AddressResolution calls, no?

This is very much worth discussion. I'll give this one some thought.

...mos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequest.java

.../main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestManager.java

...in/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdClientChannelPool.java

…ture/cosmos-4.2/connection-state-listener

David-Noble-at-work · 2020-09-06T22:47:10Z

/azp run java - cosmos - tests

azure-pipelines · 2020-09-06T22:47:21Z

Azure Pipelines successfully started running 1 pipeline(s).

moderakh

There are code style change in this PR.

the direct stack is complex, with code style change it is not easy to review the change.

Could you revert any code style change please? that should help in validating the code in the code review

moderakh · 2020-09-10T19:45:15Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/HttpConstants.java

+        // Sub-status code zero in a response from a service endpoint indicates that a replica is being discontinued or
+        // reconfigured. When endpoint rediscovery is enabled the RntbdTransportClient converts sub-status code zero to
+        // this sub-status code value.
+        public static final int DISCONTINUING_SERVICE = CLIENT_GENERATED + 2;


when rntbd throws GoneException, Gone will reach GoneAndRetryWithRetryPolicy, and it will decide how it should get retried, based on status/code and substatus code.

Please check the behaviour of GoneAndRetryWithRetryPolicy if for the given statuscode/substatus code it does the desired behabour or not.

moderakh · 2020-09-10T19:47:37Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RequestTimeline.java

@@ -156,13 +156,10 @@ public String toString() {
        @JsonIgnore
        private final Duration duration;

-        @JsonSerialize(using = ToStringSerializer.class)
-        private final long durationInMicroSec;


as durationInMicroSec is removed, does the PR changes the information available in rntbd request diagnostics?

if so could you provide a sample on new rntbd request timeline diagnostics format, info?

moderakh · 2020-09-10T19:49:11Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java

+        this.reactorHttpClient = httpClient();
+        this.globalEndpointManager = new GlobalEndpointManager(asDatabaseAccountManagerInternal(), this.connectionPolicy, /**/configs);
+        this.retryPolicy = new RetryPolicy(this.globalEndpointManager, this.connectionPolicy);
+        this.resetSessionTokenRetryPolicy = retryPolicy;


this should not change. if you have a stale branch please merge master to your branch and undo unrelated change here.

moderakh · 2020-09-10T19:50:57Z

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java

-                    if (collection == null) {
-                        throw new IllegalStateException("Collection cannot be null");
-                    }
+                   .flatMap(documentCollectionResourceResponse -> {


this is not related to this PR. could you please undo reformatting unrelated code?

code style reformatting in unrelated code makes it hard to follow/discover/validate the logic change.

if there is issue with code style in existing unrelated code. Please don't include the fix for that here. code style change should go to a different PR.

moderakh · 2020-09-10T19:59:03Z

...main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdServiceEndpoint.java

@@ -224,7 +247,9 @@ private void releaseToPool(final Channel channel) {
    }

    private void throwIfClosed() {
-        checkState(!this.closed.get(), "%s is closed", this);
+        if (this.closed.get()) {
+            throw new TransportException(lenientFormat("%s is closed", this), new IllegalArgumentException());


Suggested change

throw new TransportException(lenientFormat("%s is closed", this), new IllegalArgumentException());

throw new TransportException(lenientFormat("%s is closed", this), new IllegalStateException());

isn't IllegalStateException more appropriate than IllegalArgumentException in this case?

xinlian12 · 2020-10-12T22:58:10Z

This PR has been split into three PRs:
#16204
#16197
#15991

David Noble added 30 commits November 6, 2019 20:36

Port from v4

84a37ab

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

d218925

Merge branch 'master' of github.com:David-Noble-at-work/azure-sdk-for…

11cbbf3

…-java

Corrected package misspelling in log4j.properties and removed System.…

05c7e05

…exit from Main.java

Merge branch 'master' of github.com:David-Noble-at-work/azure-sdk-for…

8dfa3db

…-java

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

e6e71f5

Merge branch 'master' of github.com:David-Noble-at-work/azure-sdk-for…

e082d81

…-java

Merge branch 'master' of github.com:David-Noble-at-work/azure-sdk-for…

069c822

…-java

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

3ead1cc

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

669ba8b

Responded to code review comments

735f572

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

6db330d

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

48855b5

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

2367f50

Updated sdk/cosmos/README.md with info on using system properties to …

fb9cac0

…modify default Direct TCP options

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

1dd708e

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

203507a

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

368b10f

Merge branch 'master' of github.com:David-Noble-at-work/azure-sdk-for…

a0b44a7

…-java

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

d810ba8

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

64a12af

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java

29f2c73

Checkpoint for safe keeping

e4df526

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java …

916f99e

…into feature/cosmos-4.2/connection-state-listener

Merge branch 'master' of https://github.com/Azure/azure-sdk-for-java …

ae1b11b

…into feature/cosmos-4.2/connection-state-listener

Updated error handling in RntbdTransportClient.invokeStoreAsync, opti…

019e90b

…mized rntbd imports, and disambiguated a couple of diagnostics messages in RntbdClientChannelPool.

Merge branch 'master' of github.com:Azure/azure-sdk-for-java into fea…

ba4d4ed

…ture/cosmos-4.2/connection-state-listener

Ported changes from David-Noble-at-work/issue/cosmos-4.X/#10401

2f0b886

Checkpoint for safe keeping.

ba13859

Checkpoint for safe keeping.

3293e11

ghost added the Cosmos label Sep 1, 2020

moderakh suggested changes Sep 2, 2020

View reviewed changes

Completed a number of TODOs for this PR

c280476