-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Cosmos | Connection endpoint rediscovery (connection state listener) #14697
[FEATURE] Cosmos | Connection endpoint rediscovery (connection state listener) #14697
Conversation
…exit from Main.java
…modify default Direct TCP options
…into feature/cosmos-4.2/connection-state-listener
…into feature/cosmos-4.2/connection-state-listener
…mized rntbd imports, and disambiguated a couple of diagnostics messages in RntbdClientChannelPool.
…ture/cosmos-4.2/connection-state-listener
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the description which now includes the substance of this comment.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ConnectionPolicy.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/HttpConstants.java
Outdated
Show resolved
Hide resolved
@@ -156,13 +156,10 @@ public String toString() { | |||
@JsonIgnore | |||
private final Duration duration; | |||
|
|||
@JsonSerialize(using = ToStringSerializer.class) | |||
private final long durationInMicroSec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing this? I found this very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is reported, just not stored as a value. Are you saying that it is useful for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as durationInMicroSec
is removed, does the PR changes the information available in rntbd request diagnostics?
if so could you provide a sample on new rntbd request timeline diagnostics format, info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value will still be logged:
"serializationDiagnosticsContext":{"serializationDiagnosticsList":[{"serializationType":"ITEM_SERIALIZATION","startTimeUTC":"22 Sep 2020 17:40:59.239","endTimeUTC":"22 Sep 2020 17:40:59.239","durationInMicroSec":0},{"serializationType":"PARTITION_KEY_FETCH_SERIALIZATION","startTimeUTC":"22 Sep 2020 17:40:59.317","endTimeUTC":"22 Sep 2020 17:40:59.323","durationInMicroSec":5998}]},"gatewayStatistics":null,"systemInformation":{"usedMemory":"60586 KB","availableMemory":"8309590 KB","systemCpuLoad":"(2020-09-22T17:40:58.729927700Z 20.1%)"}}
if (collection == null) { | ||
throw new IllegalStateException("Collection cannot be null"); | ||
} | ||
.flatMap(documentCollectionResourceResponse -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to change the readMany API?
please undo this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not related to this PR. could you please undo reformatting unrelated code?
code style reformatting in unrelated code makes it hard to follow/discover/validate the logic change.
if there is issue with code style in existing unrelated code. Please don't include the fix for that here. code style change should go to a different PR.
...c/main/java/com/azure/cosmos/implementation/directconnectivity/AddressResolverExtension.java
Show resolved
Hide resolved
...c/main/java/com/azure/cosmos/implementation/directconnectivity/AddressResolverExtension.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/azure/cosmos/implementation/directconnectivity/GlobalAddressResolver.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/azure/cosmos/implementation/directconnectivity/AddressResolverExtension.java
Show resolved
Hide resolved
@@ -79,6 +81,28 @@ public DirectConnectionConfig setConnectTimeout(Duration connectTimeout) { | |||
return this; | |||
} | |||
|
|||
/** | |||
* Gets a value that indicates whether Direct TCP connection endpoint rediscovery should is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand on what it is from CX perspective.
QQ; Do we want this feature flag as prominent in ConnectionConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good question whether we want this flag as prominent as it is. This PR now enables connection endpoint rediscovery by default. I would prefer not to advertise the feature until we've got more experience with it. Enabling the feature by default is counter to that. One might argue that enabling the feature by default with a highly visible option to turn the feature off is preferred because it's easier to do that in code than by using a less obvious mechanism (such as azure.cosmos.directTcp.defaultOptions).
@kushagraThapar, @moderakh what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, any feature may have bug on its first release. I prefer if we change the default to enabled later when we are more confident on the behaviour.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ConnectionPolicy.java
Outdated
Show resolved
Hide resolved
// Sub-status code zero in a response from a service endpoint indicates that a replica is being discontinued or | ||
// reconfigured. When endpoint rediscovery is enabled the RntbdTransportClient converts sub-status code zero to | ||
// this sub-status code value. | ||
public static final int DISCONTINUING_SERVICE = CLIENT_GENERATED + 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How doe these sub-status code impact GoneRetryPolicy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed by way of the logs on test33 and also by code inspection. This does not effect retry policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when rntbd throws GoneException, Gone will reach GoneAndRetryWithRetryPolicy, and it will decide how it should get retried, based on status/code and substatus code.
Please check the behaviour of GoneAndRetryWithRetryPolicy if for the given statuscode/substatus code it does the desired behabour or not.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RequestTimeline.java
Show resolved
Hide resolved
@@ -135,6 +137,16 @@ public GatewayAddressCache( | |||
DefaultSuboptimalPartitionForceRefreshIntervalInSeconds); | |||
} | |||
|
|||
|
|||
@Override | |||
public void removeAddresses(final PartitionKeyRangeIdentity partitionKeyRangeIdentity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is remove un-conditional or conditional?
I.e. in-case of high throughut clients simple remove might result in removing existing valid refresh entries.
Do they need to be based on the stale state instead?
Otherwise it might load Gateway for more AddressResolution calls, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very much worth discussion. I'll give this one some thought.
...mos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequest.java
Outdated
Show resolved
Hide resolved
.../main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestManager.java
Outdated
Show resolved
Hide resolved
...in/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdClientChannelPool.java
Show resolved
Hide resolved
...in/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdClientChannelPool.java
Show resolved
Hide resolved
…ture/cosmos-4.2/connection-state-listener
…ture/cosmos-4.2/connection-state-listener
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are code style change in this PR.
the direct stack is complex, with code style change it is not easy to review the change.
Could you revert any code style change please? that should help in validating the code in the code review
// Sub-status code zero in a response from a service endpoint indicates that a replica is being discontinued or | ||
// reconfigured. When endpoint rediscovery is enabled the RntbdTransportClient converts sub-status code zero to | ||
// this sub-status code value. | ||
public static final int DISCONTINUING_SERVICE = CLIENT_GENERATED + 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when rntbd throws GoneException, Gone will reach GoneAndRetryWithRetryPolicy, and it will decide how it should get retried, based on status/code and substatus code.
Please check the behaviour of GoneAndRetryWithRetryPolicy if for the given statuscode/substatus code it does the desired behabour or not.
@@ -156,13 +156,10 @@ public String toString() { | |||
@JsonIgnore | |||
private final Duration duration; | |||
|
|||
@JsonSerialize(using = ToStringSerializer.class) | |||
private final long durationInMicroSec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as durationInMicroSec
is removed, does the PR changes the information available in rntbd request diagnostics?
if so could you provide a sample on new rntbd request timeline diagnostics format, info?
this.reactorHttpClient = httpClient(); | ||
this.globalEndpointManager = new GlobalEndpointManager(asDatabaseAccountManagerInternal(), this.connectionPolicy, /**/configs); | ||
this.retryPolicy = new RetryPolicy(this.globalEndpointManager, this.connectionPolicy); | ||
this.resetSessionTokenRetryPolicy = retryPolicy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not change. if you have a stale branch please merge master to your branch and undo unrelated change here.
if (collection == null) { | ||
throw new IllegalStateException("Collection cannot be null"); | ||
} | ||
.flatMap(documentCollectionResourceResponse -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not related to this PR. could you please undo reformatting unrelated code?
code style reformatting in unrelated code makes it hard to follow/discover/validate the logic change.
if there is issue with code style in existing unrelated code. Please don't include the fix for that here. code style change should go to a different PR.
@@ -224,7 +247,9 @@ private void releaseToPool(final Channel channel) { | |||
} | |||
|
|||
private void throwIfClosed() { | |||
checkState(!this.closed.get(), "%s is closed", this); | |||
if (this.closed.get()) { | |||
throw new TransportException(lenientFormat("%s is closed", this), new IllegalArgumentException()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw new TransportException(lenientFormat("%s is closed", this), new IllegalArgumentException()); | |
throw new TransportException(lenientFormat("%s is closed", this), new IllegalStateException()); |
isn't IllegalStateException
more appropriate than IllegalArgumentException
in this case?
TODO
Collected and analyzed numbers from our prod-03 environment
See read latency benchmark results
Collected numbers from our prod-03 environment (As part of read latency benchmark run)
See read latency benchmark results
Collected numbers from our prod-03 environment (As part of read latency benchmark run)
See read latency benchmark results
Collected numbers from our prod-03 environment (As part of read latency benchmark run)
See read latency benchmark results
Reviewed with @xinlian12, @kushagraThapar, @moderakh
Collecting data from our test-33 environment
Next up: Analyze results
This PR will be ready for merge when we're convinced that the feature is functionally correct and does not introduce performance or reliability issues.
Purpose
The connection endpoint rediscovery feature is designed to reduce and spread-out high latency spikes that are likely to occur:
Our expectation is that this PR will improve reliability without effecting performance under normal circumstances when there is no rolling upgrade taking place and no servers being decommissioned or restarted.
Implementation
The
GlobalAddressResolver
now implementsAddressResolverExtension
. This new interface extendsIAddressResolver
and adds methods to support theRntbdConnectionStateListener
. TheRntbdTransportClient
now uses theRntbdConnectionStateListener
tobuild a reverse lookup table that maps physical endpoint addresses to sets of
PartitionKeyRangeIdentity
instances. New endpoint addresses are added each time thatRntbdTransportClient::invokeStoreAsync
is called.When we detect that a server may be down or going down, we remove the effected
PartitionKeyRangeIdentity
instances from the physical address cache and close the effectedRntbdEndpoint
. The physical addresses that service an effectedPartitionKeyRangeIdentity
will then be updated on the next request as if it were the first request targeting thePartitionKeyRangeIdentity
. Here are the conditions thatRntbdTransportClient::invokeStoreAsync
detects:A graceful shutdown is occurring.
This is indicated by a
GoneException
with sub-status code zero (SubStatusCodes.UNKOWN
).The service is down.
This is indicated by a
GoneException
with non-null cause. We expect anIOException
as the cause. Common causes of typeIOException
are:ConnectTimeoutException
Indicates that an attempt to connect to a remote host timed out.
ConnectException
Indicates that the server can't be reached.
ClosedChannelException
Indicates a connection dropped.
IOException
Indicates a connection was reset by the remote host.
Endpoint closure is aggressive. We close an endpoint (all channels) at the time that a connection issue is detected on any channel. We will consider less aggressive strategies based on what we find in test. We are cognizant of the fact that unnecessary
RntbdEndpoint
evictions could cause a flood of retries. We will adapt to what we find in test.This feature is hidden behind a new feature flag:
DirectConnectionConfig::enableConnectionEndpointRediscovery
(default: false). When we're satisfied with the implementation, we'll change the default totrue
.Read Latency Benchmark results
All benchmark results were produced in the prod-03 test environment:
Cosmos DB instance:
cosmos-sdk-core-3
Virtual machine:
Linux (ubuntu 18.04)
Standard F16s_v2 (16 vcpus, 32 GiB memory)
Here is the raw data and a Jupyter notebook for further analysis.
Charts
This chart shows socket counts by concurrency level with
documentCount
1
. Socket counts are summed over the set of observations taken at the end of each of five one minute periods in the benchmark. Socket counts were collected using ss. Divide the socket count by five to get the average count at the end of each period.This chart shows socket counts by concurrency level with
documentCount
100,000
. Socket counts are summed over the set of observations taken at the end of each of five one minute periods in the benchmark. Socket counts were collected using ss. Divide the socket count by five to get the average count at the end of each period.This chart shows virtual memory utilization in mebibytes by concurrency level with
documentCount
1
. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. Virtual memory numbers were collected using top.This chart shows virtual memory utilization in mebibytes by concurrency level with
documentCount
100,000
. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. Virtual memory numbers were collected using top.This chart shows CPU utilization as a percentage by concurrency level with
documentCount
1
. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. CPU utilization numbers were collected using top.This chart shows CPU utilization as a percentage by concurrency level with
documentCount
100,000
. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. CPU utilization numbers were collected using top.Later: In a follow-on PR
Distinguish between requests that fail because of a connection issue:
Rationale: enables retries for create/delete/update operations that are known not to reach the server.
TransportException
was developed with this in mind. This info can be extracted from theRequestTimeline
and stored into the exception.Ensure that Cosmos client never instantiates
GoneException
with sub-status code zero.Consider whether the client should instantiate any
CosmosException
using a sub-status code that may be returned in a response from the server.