Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CCS network faults detection #34405

Closed
3 of 5 tasks
javanna opened this issue Oct 11, 2018 · 13 comments
Closed
3 of 5 tasks

Improve CCS network faults detection #34405

javanna opened this issue Oct 11, 2018 · 13 comments
Labels
:Distributed Coordination/Network Http and internode communication implementations Meta Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. team-discuss

Comments

@javanna
Copy link
Member

javanna commented Oct 11, 2018

When using Cross Cluster Search and a remote cluster becomes unreachable due to network issues, it takes the CCS node a while to detect that. This seems particularly bad if a firewall in-between drops connections, as it makes CCS searches hang, despite TCP connections can be initiated from the CCS node to the remote cluster nodes on port 9300.

This has been reported on our forum and also on #30247 .

The following are changes that we could make to improve this:

@javanna javanna added >bug :Search/Search Search-related issues that do not fall into other categories labels Oct 11, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@Tim-Brooks
Copy link
Contributor

Tim-Brooks commented Oct 11, 2018

possibly adapt the transport ping to come back with a response and support a timeout

I want to leave a comment about this option. Currently a ping is 6 bytes ('E','S',-1). There is no response.

  1. We could move this to be an actual transport message (say TransportPing). This makes the message larger as there are a number of things required for a whole transport message.

  2. We could also expand the unique nature of a ping (say 'E', 'S', -1, and then a random or incrementing int or long as a ping identifier). This would allow the receiving node to echo the ping back to the sending node. We could make it so that the client connection sends pings and the server connection responds. This will still test writes in both directions. If it were to timeout, the client would close the connection.

  3. We could make some type of CCS specific ping (similar to how I think we have a ZenPing). But then we probably need one for CCR eventually.

@Tim-Brooks
Copy link
Contributor

Tim-Brooks commented Oct 11, 2018

I think that #3 is the worst option.

Option #2 is okay, but it is a little gross as -1 technically is supposed to mean that the message has no length (that is how we identify pings). But now there would be an additional int.

@DaveCTurner
Copy link
Contributor

I think #1 would be a fine solution here.

I've most often seen this where a firewall decides a connection is idle and drops it, black holing any future traffic on that connection. In this case, both TCP keepalives and transport pings are sufficient to prevent it, because the firewall will see packets in both directions (the message itself, and the corresponding ACK) and not drop the connection. The issue we often face with TCP keepalives is that security policy sometimes oddly dictates that keepalives may not be set below the default of 2h (on Linux) and the firewall drops connections after 1h, which is why we have to use transport pings too. We recommend properly configured TCP keepalives in the docs but do not spell out that this applies to cross-cluster connections too.

In any case if a keepalive or a ping doesn't go through then we will receive a notification a short while later, regardless of whether there's any application-level response, because we can rely on TCP retrying a few times until it receives an ACK and then eventually closing the connection, to which we react.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Oct 12, 2018

I think #1 would be a fine solution here.

I meant @javanna's #1, i.e., using application-level pings.

@javanna
Copy link
Member Author

javanna commented Oct 12, 2018

I agree that the application-level pings will help keeping connections alive, but sadly we will still not detect network disconnections quickly enough, see https://discuss.elastic.co/t/elasticsearch-ccs-client-get-timeout-when-remote-cluster-is-isolated-by-firewall/152019/6 . Does that make sense to you as well @DaveCTurner ?

@DaveCTurner
Copy link
Contributor

I dug into the details a bit further and was surprised by Linux's default behaviour here.

On Linux the number of retransmissions for a TCP packet before the connection is dropped is /proc/sys/net/ipv4/tcp_retries2 which defaults to 15. Retries start quickly but back off exponentially so 15 retries take a little over 15 minutes. I'm surprised to learn it's this long, and 15 retries seems unreasonably many here. I think I must have tested things on systems on which this had been lowered from the default; I tried reducing it to 6 and observed connections being detected as dead and dropped after ~30sec. However this is a system-wide parameter, so it may not be desirable to change it, and we'd certainly need to think carefully before issuing a general recommendation about this. There are already places that suggest reducing this to 3 in a HA situation and they are not alone despite that RFC 1122 suggests it should be ≥8. To me 3 feels too low for connections that span any appreciable geographical distance, but might be fine within a single datacentre.

On Linux there's also the per-connection TCP_USER_TIMEOUT but this isn't portable (see https://bugs.openjdk.java.net/browse/JDK-8038145). I haven't looked into the alternatives available on Windows.

One advantage of dealing with this at the TCP layer is that it's really just looking at the connection, and is insensitive to things like a GC pause on the remote node. However if we want to avoid this kind of TCP tuning then adapting the application-level pings to be bidirectional does seem like a better approach. One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model.

@jasontedor jasontedor added :Distributed Coordination/Network Http and internode communication implementations and removed :Search/Search Search-related issues that do not fall into other categories labels Oct 12, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@s1monw
Copy link
Contributor

s1monw commented Oct 13, 2018

One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model.

I did think of this as well but it might be tricky for us since we don't necessarily have a bi-directional connection here so implementing this would be quite tricky. What we can do is drive the heart-beat from one side of the connection and don't necessarily wait for a response. In such a case we can just send back a ping every time we receive one. This way we can implement it on the top level in TcpTransport and don't have to break all our abstractions. if we then didn't receive a ping from a node for X ms we can still declare the connection dead.

javanna added a commit to javanna/elasticsearch that referenced this issue Oct 23, 2018
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we enable application-level pings by default every 5 seconds from CCS nodes to the selected remote nodes. We also add a setting called `cluster.remote.ping_schedule` that allows to change the interval and potentially disable application-level pings, similar to `transport.ping_schedule` but the new setting only affects connections made to remote clusters.

Relates to elastic#34405
javanna added a commit that referenced this issue Oct 31, 2018
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we allow to enable application-level pings specifically from CCS nodes to the selected remote nodes through the new setting `cluster.remote.${clusterAlias}.transport.ping_schedule`.  The new setting is similar `transport.ping_schedule` but it does not affect intra-cluster communication, pings are only sent to specific remote cluster when specifically enabled, as they are disabled by default.

Relates to #34405
javanna added a commit that referenced this issue Nov 1, 2018
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we allow to enable application-level pings specifically from CCS nodes to the selected remote nodes through the new setting `cluster.remote.${clusterAlias}.transport.ping_schedule`.  The new setting is similar `transport.ping_schedule` but it does not affect intra-cluster communication, pings are only sent to specific remote cluster when specifically enabled, as they are disabled by default.

Relates to #34405
@javanna javanna added Meta and removed >bug labels Nov 6, 2018
@s1monw
Copy link
Contributor

s1monw commented Nov 6, 2018

@javanna I see you have been working on adding support to enable pings on CC connections only. Good stuff. Regarding enabling it by default I wonder if we can be a bit smarter than we are today and enable it by default when the connections haven't been used for like a minute or two and then go and make sure they get at least a ping every 60 seconds. The defaults are quite low and we might be able to prevent most of the issues if we do it once in a while? @jasontedor WDYT

Tim-Brooks added a commit that referenced this issue Nov 29, 2018
This is related to #34405 and a follow-up to #34753. It makes a number
of changes to our current keepalive pings.

The ping interval configuration is moved to the ConnectionProfile.

The server channel now responds to pings. This makes the keepalive
pings bidirectional.

On the client-side, the pings can now be optimized away. What this
means is that if the channel has received a message or sent a message
since the last pinging round, the ping is not sent for this round.
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Nov 29, 2018
This is related to elastic#34405 and a follow-up to elastic#34753. It makes a number
of changes to our current keepalive pings.

The ping interval configuration is moved to the ConnectionProfile.

The server channel now responds to pings. This makes the keepalive
pings bidirectional.

On the client-side, the pings can now be optimized away. What this
means is that if the channel has received a message or sent a message
since the last pinging round, the ping is not sent for this round.
Tim-Brooks added a commit that referenced this issue Nov 29, 2018
This is related to #34405 and a follow-up to #34753. It makes a number
of changes to our current keepalive pings.

The ping interval configuration is moved to the ConnectionProfile.

The server channel now responds to pings. This makes the keepalive
pings bidirectional.

On the client-side, the pings can now be optimized away. What this
means is that if the channel has received a message or sent a message
since the last pinging round, the ping is not sent for this round.
@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 8, 2020
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect
network partitions more quickly.

Relates elastic#34405
DaveCTurner added a commit that referenced this issue Jul 27, 2020
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect
network partitions more quickly.

Relates #34405
DaveCTurner added a commit that referenced this issue Jul 27, 2020
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect
network partitions more quickly.

Relates #34405
DaveCTurner added a commit that referenced this issue Jul 27, 2020
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect
network partitions more quickly.

Relates #34405
@DaveCTurner
Copy link
Contributor

I'm marking this for team discussion again because I think the two remaining items are effectively solved by other means and therefore we can close it. We enable TCP keepalives by default with a sensible interval and recommend users set tcp_retries2 to a sensible value as well. Together these TCP-level timeouts will have the effect of keeping connections open if possible, and detecting a dropped connection relatively quickly, obviating the need for application-level effort to do the same in most cases.

@DaveCTurner
Copy link
Contributor

I have one further question in this area regarding how RemoteClusterAwareClient#doExecute always calls RemoteClusterService#ensureConnected:

void doExecute(ActionType<Response> action, Request request, ActionListener<Response> listener) {
remoteClusterService.ensureConnected(clusterAlias, ActionListener.wrap(v -> {

If the remote cluster is in a network black hole then this will trigger a new connection attempt that then times out after 30s (I think). Furthermore we may end up waiting for a multiple of the timeout period if we're sending requests in sequence to the remote cluster, as I believe we do in CCS. We should consider whether/when to fail requests quickly if the remote is disconnected rather than waiting on another connection attempt. If the skip_unavailable flag is set that seems like the right thing to do at least.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 30, 2021
Today the docs on setting `tcp_retries2` only talk about intra-cluster
connections, but in fact this setting is equally important to the
resilience of remote cluster connections too. This commit rewords these
docs to cover both cases.

Relates elastic#34405
@DaveCTurner
Copy link
Contributor

We (the @elastic/es-distributed team) discussed this and agreed that this is good to close for the reasons I gave above. We were a little time-constrained so I'll leave it open for another couple of days in case anyone has further thoughts to share async.

I opened #74773 to ask the search team to consider the remaining question I raised in my previous message.

DaveCTurner added a commit that referenced this issue Jul 5, 2021
Today the docs on setting `tcp_retries2` only talk about intra-cluster
connections, but in fact this setting is equally important to the
resilience of remote cluster connections too. This commit rewords these
docs to cover both cases.

Relates #34405
DaveCTurner added a commit that referenced this issue Jul 5, 2021
Today the docs on setting `tcp_retries2` only talk about intra-cluster
connections, but in fact this setting is equally important to the
resilience of remote cluster connections too. This commit rewords these
docs to cover both cases.

Relates #34405
DaveCTurner added a commit that referenced this issue Jul 5, 2021
Today the docs on setting `tcp_retries2` only talk about intra-cluster
connections, but in fact this setting is equally important to the
resilience of remote cluster connections too. This commit rewords these
docs to cover both cases.

Relates #34405
DaveCTurner added a commit that referenced this issue Jul 5, 2021
Today the docs on setting `tcp_retries2` only talk about intra-cluster
connections, but in fact this setting is equally important to the
resilience of remote cluster connections too. This commit rewords these
docs to cover both cases.

Relates #34405
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations Meta Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. team-discuss
Projects
None yet
Development

No branches or pull requests

7 participants