-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CCS network faults detection #34405
Comments
Pinging @elastic/es-search-aggs |
I want to leave a comment about this option. Currently a ping is 6 bytes ('E','S',-1). There is no response.
|
I think that Option |
I think I've most often seen this where a firewall decides a connection is idle and drops it, black holing any future traffic on that connection. In this case, both TCP keepalives and transport pings are sufficient to prevent it, because the firewall will see packets in both directions (the message itself, and the corresponding ACK) and not drop the connection. The issue we often face with TCP keepalives is that security policy sometimes oddly dictates that keepalives may not be set below the default of 2h (on Linux) and the firewall drops connections after 1h, which is why we have to use transport pings too. We recommend properly configured TCP keepalives in the docs but do not spell out that this applies to cross-cluster connections too. In any case if a keepalive or a ping doesn't go through then we will receive a notification a short while later, regardless of whether there's any application-level response, because we can rely on TCP retrying a few times until it receives an ACK and then eventually closing the connection, to which we react. |
I meant @javanna's |
I agree that the application-level pings will help keeping connections alive, but sadly we will still not detect network disconnections quickly enough, see https://discuss.elastic.co/t/elasticsearch-ccs-client-get-timeout-when-remote-cluster-is-isolated-by-firewall/152019/6 . Does that make sense to you as well @DaveCTurner ? |
I dug into the details a bit further and was surprised by Linux's default behaviour here. On Linux the number of retransmissions for a TCP packet before the connection is dropped is On Linux there's also the per-connection One advantage of dealing with this at the TCP layer is that it's really just looking at the connection, and is insensitive to things like a GC pause on the remote node. However if we want to avoid this kind of TCP tuning then adapting the application-level pings to be bidirectional does seem like a better approach. One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model. |
Pinging @elastic/es-distributed |
I did think of this as well but it might be tricky for us since we don't necessarily have a bi-directional connection here so implementing this would be quite tricky. What we can do is drive the heart-beat from one side of the connection and don't necessarily wait for a response. In such a case we can just send back a ping every time we receive one. This way we can implement it on the top level in TcpTransport and don't have to break all our abstractions. if we then didn't receive a ping from a node for X ms we can still declare the connection dead. |
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we enable application-level pings by default every 5 seconds from CCS nodes to the selected remote nodes. We also add a setting called `cluster.remote.ping_schedule` that allows to change the interval and potentially disable application-level pings, similar to `transport.ping_schedule` but the new setting only affects connections made to remote clusters. Relates to elastic#34405
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we allow to enable application-level pings specifically from CCS nodes to the selected remote nodes through the new setting `cluster.remote.${clusterAlias}.transport.ping_schedule`. The new setting is similar `transport.ping_schedule` but it does not affect intra-cluster communication, pings are only sent to specific remote cluster when specifically enabled, as they are disabled by default. Relates to #34405
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we allow to enable application-level pings specifically from CCS nodes to the selected remote nodes through the new setting `cluster.remote.${clusterAlias}.transport.ping_schedule`. The new setting is similar `transport.ping_schedule` but it does not affect intra-cluster communication, pings are only sent to specific remote cluster when specifically enabled, as they are disabled by default. Relates to #34405
@javanna I see you have been working on adding support to enable pings on CC connections only. Good stuff. Regarding enabling it by default I wonder if we can be a bit smarter than we are today and enable it by default when the connections haven't been used for like a minute or two and then go and make sure they get at least a ping every 60 seconds. The defaults are quite low and we might be able to prevent most of the issues if we do it once in a while? @jasontedor WDYT |
This is related to #34405 and a follow-up to #34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.
This is related to elastic#34405 and a follow-up to elastic#34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.
This is related to #34405 and a follow-up to #34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates elastic#34405
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates #34405
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates #34405
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates #34405
I'm marking this for team discussion again because I think the two remaining items are effectively solved by other means and therefore we can close it. We enable TCP keepalives by default with a sensible interval and recommend users set |
I have one further question in this area regarding how elasticsearch/server/src/main/java/org/elasticsearch/transport/RemoteClusterAwareClient.java Lines 36 to 37 in a92a647
If the remote cluster is in a network black hole then this will trigger a new connection attempt that then times out after 30s (I think). Furthermore we may end up waiting for a multiple of the timeout period if we're sending requests in sequence to the remote cluster, as I believe we do in CCS. We should consider whether/when to fail requests quickly if the remote is disconnected rather than waiting on another connection attempt. If the |
Today the docs on setting `tcp_retries2` only talk about intra-cluster connections, but in fact this setting is equally important to the resilience of remote cluster connections too. This commit rewords these docs to cover both cases. Relates elastic#34405
We (the @elastic/es-distributed team) discussed this and agreed that this is good to close for the reasons I gave above. We were a little time-constrained so I'll leave it open for another couple of days in case anyone has further thoughts to share async. I opened #74773 to ask the search team to consider the remaining question I raised in my previous message. |
Today the docs on setting `tcp_retries2` only talk about intra-cluster connections, but in fact this setting is equally important to the resilience of remote cluster connections too. This commit rewords these docs to cover both cases. Relates #34405
Today the docs on setting `tcp_retries2` only talk about intra-cluster connections, but in fact this setting is equally important to the resilience of remote cluster connections too. This commit rewords these docs to cover both cases. Relates #34405
Today the docs on setting `tcp_retries2` only talk about intra-cluster connections, but in fact this setting is equally important to the resilience of remote cluster connections too. This commit rewords these docs to cover both cases. Relates #34405
Today the docs on setting `tcp_retries2` only talk about intra-cluster connections, but in fact this setting is equally important to the resilience of remote cluster connections too. This commit rewords these docs to cover both cases. Relates #34405
When using Cross Cluster Search and a remote cluster becomes unreachable due to network issues, it takes the CCS node a while to detect that. This seems particularly bad if a firewall in-between drops connections, as it makes CCS searches hang, despite TCP connections can be initiated from the CCS node to the remote cluster nodes on port 9300.
This has been reported on our forum and also on #30247 .
The following are changes that we could make to improve this:
timeout
parameter be honored? #32678)The text was updated successfully, but these errors were encountered: