Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Cosmos | Connection endpoint rediscovery (connection state listener) #14697

Conversation

David-Noble-at-work
Copy link

@David-Noble-at-work David-Noble-at-work commented Sep 1, 2020

TODO

  • Feature description
  • Read latency benchmarks
    Collected and analyzed numbers from our prod-03 environment
    See read latency benchmark results
  • Socket utilization numbers
    Collected numbers from our prod-03 environment (As part of read latency benchmark run)
    See read latency benchmark results
  • Memory utilization numbers
    Collected numbers from our prod-03 environment (As part of read latency benchmark run)
    See read latency benchmark results
  • CPU utilization numbers
    Collected numbers from our prod-03 environment (As part of read latency benchmark run)
    See read latency benchmark results
  • Thought experiment that builds confidence that this feature corrects other read latency issues we've been seeing
    Reviewed with @xinlian12, @kushagraThapar, @moderakh
  • Upgrade test results with evidence that this feature functions as expected without performance or reliability issues.
    Collecting data from our test-33 environment
    Next up: Analyze results

This PR will be ready for merge when we're convinced that the feature is functionally correct and does not introduce performance or reliability issues.

Purpose

The connection endpoint rediscovery feature is designed to reduce and spread-out high latency spikes that are likely to occur:

  • During rolling upgrades of a Cosmos instance or
  • When a backend node is being decommissioned or restarted (e.g., to restart or remove an unhealthy replica)

Our expectation is that this PR will improve reliability without effecting performance under normal circumstances when there is no rolling upgrade taking place and no servers being decommissioned or restarted.

Implementation

The GlobalAddressResolver now implements AddressResolverExtension. This new interface extends IAddressResolver and adds methods to support the RntbdConnectionStateListener. The RntbdTransportClient now uses the RntbdConnectionStateListener to
build a reverse lookup table that maps physical endpoint addresses to sets of PartitionKeyRangeIdentity instances. New endpoint addresses are added each time that RntbdTransportClient::invokeStoreAsync is called.

When we detect that a server may be down or going down, we remove the effected PartitionKeyRangeIdentity instances from the physical address cache and close the effected RntbdEndpoint. The physical addresses that service an effected PartitionKeyRangeIdentity will then be updated on the next request as if it were the first request targeting the PartitionKeyRangeIdentity. Here are the conditions that RntbdTransportClient::invokeStoreAsync detects:

  • A graceful shutdown is occurring.

    This is indicated by a GoneException with sub-status code zero (SubStatusCodes.UNKOWN).

  • The service is down.

    This is indicated by a GoneException with non-null cause. We expect an IOException as the cause. Common causes of type IOException are:

    • ConnectTimeoutException

      Indicates that an attempt to connect to a remote host timed out.

    • ConnectException

      Indicates that the server can't be reached.

    • ClosedChannelException

      Indicates a connection dropped.

    • IOException

      Indicates a connection was reset by the remote host.

Endpoint closure is aggressive. We close an endpoint (all channels) at the time that a connection issue is detected on any channel. We will consider less aggressive strategies based on what we find in test. We are cognizant of the fact that unnecessary RntbdEndpoint evictions could cause a flood of retries. We will adapt to what we find in test.

This feature is hidden behind a new feature flag: DirectConnectionConfig::enableConnectionEndpointRediscovery (default: false). When we're satisfied with the implementation, we'll change the default to true.

Read Latency Benchmark results

All benchmark results were produced in the prod-03 test environment:

  • Cosmos DB instance:

    cosmos-sdk-core-3

  • Virtual machine:

    Linux (ubuntu 18.04)

    Standard F16s_v2 (16 vcpus, 32 GiB memory)

Here is the raw data and a Jupyter notebook for further analysis.

Charts
  1. P95, Document count: 1

image

  1. P95, Document count: 100,000

latency-p95 documentCount=100000

  1. P99, Document count: 1

latency-p99 documentCount=1

  1. P99, Document count: 100,000

latency-p99 documentCount=100000

  1. P99.9, Document count: 1

latency-p999 documentCount=1

  1. P99.9, Document count: 100,000

latency-p999 documentCount=100000

  1. Throughput, Document count: 1

throughput documentCount=1

  1. Throughput, Document count: 100,000

throughput documentCount=100000

  1. Socket utilization, Document count: 1

This chart shows socket counts by concurrency level with documentCount 1. Socket counts are summed over the set of observations taken at the end of each of five one minute periods in the benchmark. Socket counts were collected using ss. Divide the socket count by five to get the average count at the end of each period.

socket_utilization documentCount=000001

  1. Socket utilization, Document count: 100,000

This chart shows socket counts by concurrency level with documentCount 100,000. Socket counts are summed over the set of observations taken at the end of each of five one minute periods in the benchmark. Socket counts were collected using ss. Divide the socket count by five to get the average count at the end of each period.

socket_utilization documentCount=100000

  1. Virtual memory utilization, Document count: 1

This chart shows virtual memory utilization in mebibytes by concurrency level with documentCount 1. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. Virtual memory numbers were collected using top.

virtual_memory documentCount=000001

  1. Virtual memory utilization, Document count: 100,000

This chart shows virtual memory utilization in mebibytes by concurrency level with documentCount 100,000. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. Virtual memory numbers were collected using top.

virtual_memory documentCount=100000

  1. CPU utilization, Document count: 1

This chart shows CPU utilization as a percentage by concurrency level with documentCount 1. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. CPU utilization numbers were collected using top.

cpu documentCount=000001

  1. CPU utilization, Document count: 100,000

This chart shows CPU utilization as a percentage by concurrency level with documentCount 100,000. Results are averaged over the set of observations taken at the end of each of five one minute periods in the benchmark. CPU utilization numbers were collected using top.

cpu documentCount=100000

Later: In a follow-on PR

  • Distinguish between requests that fail because of a connection issue:

    • Writes (aka, sends) that do not reach the server
    • Writes (aka, sends) that reach the server (i.e., confirmed by TCP ACK)

    Rationale: enables retries for create/delete/update operations that are known not to reach the server. TransportException was developed with this in mind. This info can be extracted from the RequestTimeline and stored into the exception.

  • Ensure that Cosmos client never instantiates GoneException with sub-status code zero.

    Consider whether the client should instantiate any CosmosException using a sub-status code that may be returned in a response from the server.

David Noble added 30 commits November 6, 2019 20:36
…into feature/cosmos-4.2/connection-state-listener
…into feature/cosmos-4.2/connection-state-listener
…mized rntbd imports, and disambiguated a couple of diagnostics messages in RntbdClientChannelPool.
@ghost ghost added the Cosmos label Sep 1, 2020
Copy link
Contributor

@moderakh moderakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the description which now includes the substance of this comment.

@@ -156,13 +156,10 @@ public String toString() {
@JsonIgnore
private final Duration duration;

@JsonSerialize(using = ToStringSerializer.class)
private final long durationInMicroSec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing this? I found this very helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is reported, just not stored as a value. Are you saying that it is useful for debugging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as durationInMicroSec is removed, does the PR changes the information available in rntbd request diagnostics?

if so could you provide a sample on new rntbd request timeline diagnostics format, info?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value will still be logged:
"serializationDiagnosticsContext":{"serializationDiagnosticsList":[{"serializationType":"ITEM_SERIALIZATION","startTimeUTC":"22 Sep 2020 17:40:59.239","endTimeUTC":"22 Sep 2020 17:40:59.239","durationInMicroSec":0},{"serializationType":"PARTITION_KEY_FETCH_SERIALIZATION","startTimeUTC":"22 Sep 2020 17:40:59.317","endTimeUTC":"22 Sep 2020 17:40:59.323","durationInMicroSec":5998}]},"gatewayStatistics":null,"systemInformation":{"usedMemory":"60586 KB","availableMemory":"8309590 KB","systemCpuLoad":"(2020-09-22T17:40:58.729927700Z 20.1%)"}}

if (collection == null) {
throw new IllegalStateException("Collection cannot be null");
}
.flatMap(documentCollectionResourceResponse -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change the readMany API?

please undo this.

Copy link
Contributor

@moderakh moderakh Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not related to this PR. could you please undo reformatting unrelated code?

code style reformatting in unrelated code makes it hard to follow/discover/validate the logic change.

if there is issue with code style in existing unrelated code. Please don't include the fix for that here. code style change should go to a different PR.

@@ -79,6 +81,28 @@ public DirectConnectionConfig setConnectTimeout(Duration connectTimeout) {
return this;
}

/**
* Gets a value that indicates whether Direct TCP connection endpoint rediscovery should is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand on what it is from CX perspective.

QQ; Do we want this feature flag as prominent in ConnectionConfig?

Copy link
Author

@David-Noble-at-work David-Noble-at-work Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good question whether we want this flag as prominent as it is. This PR now enables connection endpoint rediscovery by default. I would prefer not to advertise the feature until we've got more experience with it. Enabling the feature by default is counter to that. One might argue that enabling the feature by default with a highly visible option to turn the feature off is preferred because it's easier to do that in code than by using a less obvious mechanism (such as azure.cosmos.directTcp.defaultOptions).

@kushagraThapar, @moderakh what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, any feature may have bug on its first release. I prefer if we change the default to enabled later when we are more confident on the behaviour.

// Sub-status code zero in a response from a service endpoint indicates that a replica is being discontinued or
// reconfigured. When endpoint rediscovery is enabled the RntbdTransportClient converts sub-status code zero to
// this sub-status code value.
public static final int DISCONTINUING_SERVICE = CLIENT_GENERATED + 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How doe these sub-status code impact GoneRetryPolicy?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed by way of the logs on test33 and also by code inspection. This does not effect retry policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when rntbd throws GoneException, Gone will reach GoneAndRetryWithRetryPolicy, and it will decide how it should get retried, based on status/code and substatus code.

Please check the behaviour of GoneAndRetryWithRetryPolicy if for the given statuscode/substatus code it does the desired behabour or not.

@@ -135,6 +137,16 @@ public GatewayAddressCache(
DefaultSuboptimalPartitionForceRefreshIntervalInSeconds);
}


@Override
public void removeAddresses(final PartitionKeyRangeIdentity partitionKeyRangeIdentity) {
Copy link
Member

@kirankumarkolli kirankumarkolli Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is remove un-conditional or conditional?

I.e. in-case of high throughut clients simple remove might result in removing existing valid refresh entries.
Do they need to be based on the stale state instead?

Otherwise it might load Gateway for more AddressResolution calls, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much worth discussion. I'll give this one some thought.

@David-Noble-at-work
Copy link
Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@moderakh moderakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are code style change in this PR.

the direct stack is complex, with code style change it is not easy to review the change.

Could you revert any code style change please? that should help in validating the code in the code review

// Sub-status code zero in a response from a service endpoint indicates that a replica is being discontinued or
// reconfigured. When endpoint rediscovery is enabled the RntbdTransportClient converts sub-status code zero to
// this sub-status code value.
public static final int DISCONTINUING_SERVICE = CLIENT_GENERATED + 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when rntbd throws GoneException, Gone will reach GoneAndRetryWithRetryPolicy, and it will decide how it should get retried, based on status/code and substatus code.

Please check the behaviour of GoneAndRetryWithRetryPolicy if for the given statuscode/substatus code it does the desired behabour or not.

@@ -156,13 +156,10 @@ public String toString() {
@JsonIgnore
private final Duration duration;

@JsonSerialize(using = ToStringSerializer.class)
private final long durationInMicroSec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as durationInMicroSec is removed, does the PR changes the information available in rntbd request diagnostics?

if so could you provide a sample on new rntbd request timeline diagnostics format, info?

this.reactorHttpClient = httpClient();
this.globalEndpointManager = new GlobalEndpointManager(asDatabaseAccountManagerInternal(), this.connectionPolicy, /**/configs);
this.retryPolicy = new RetryPolicy(this.globalEndpointManager, this.connectionPolicy);
this.resetSessionTokenRetryPolicy = retryPolicy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not change. if you have a stale branch please merge master to your branch and undo unrelated change here.

if (collection == null) {
throw new IllegalStateException("Collection cannot be null");
}
.flatMap(documentCollectionResourceResponse -> {
Copy link
Contributor

@moderakh moderakh Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not related to this PR. could you please undo reformatting unrelated code?

code style reformatting in unrelated code makes it hard to follow/discover/validate the logic change.

if there is issue with code style in existing unrelated code. Please don't include the fix for that here. code style change should go to a different PR.

@@ -224,7 +247,9 @@ private void releaseToPool(final Channel channel) {
}

private void throwIfClosed() {
checkState(!this.closed.get(), "%s is closed", this);
if (this.closed.get()) {
throw new TransportException(lenientFormat("%s is closed", this), new IllegalArgumentException());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw new TransportException(lenientFormat("%s is closed", this), new IllegalArgumentException());
throw new TransportException(lenientFormat("%s is closed", this), new IllegalStateException());

isn't IllegalStateException more appropriate than IllegalArgumentException in this case?

@xinlian12
Copy link
Member

This PR has been split into three PRs:
#16204
#16197
#15991

@xinlian12 xinlian12 closed this Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants