Cluster state is serialized on the network thread #39806

DaveCTurner · 2019-03-07T18:42:10Z

Today when we respond to a cluster:monitor/state action we serialize the current cluster state:

elasticsearch/server/src/main/java/org/elasticsearch/action/admin/cluster/state/TransportClusterStateAction.java

Line 186 in 4cab8ec

    
           PublicationTransportHandler.serializeFullClusterState(currentState, Version.CURRENT).length(), false));

However, we do so on the network thread:

elasticsearch/server/src/main/java/org/elasticsearch/action/admin/cluster/state/TransportClusterStateAction.java

Lines 59 to 60 in 4cab8ec

    
           // very lightweight operation in memory, no need to fork to a thread 
        
           return ThreadPool.Names.SAME;

This is a problem if the cluster state is large, because it consumes a network thread doing non-network things. It's particularly a problem if using lots of transport clients with sniffing enabled since each client will, by default, call this method every 5 seconds.

Yet we do this serialization to get the compressed size of the cluster state so that we can report it (see #3415). Many clients will not care about this. Maybe we can omit it? Alternatively, note that it only depends on the cluster state so maybe we can cache it and invalidate the cache on each cluster state update.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-03-07T18:42:13Z

Pinging @elastic/es-distributed

original-brownbear · 2019-03-07T19:56:41Z

Alternatively, note that it only depends on the cluster state so maybe we can cache it and invalidate the cache on each cluster state update.

This sounds reasonable. Even if the state is 100MB or so it shouldn't take that much time to serialise it (I hope at least :)) so it seems like a reasonable improvement.

Though I wonder, why not just run this serialization on the generic thread pool, just to be safe (we could still cache to save some CPU)? That would seem like the cleaner approach and the cost of the context switches is probably well worth not having to see this in profiling on the transport thread when trying to reason out where the next slowness/blocker lives :)

Today we compute the size of the compressed cluster state for each `cluster:monitor/state` action. We do so by serializing the whole cluster state and compressing it, and this happens on the network thread. This calculation can be rather expensive if the cluster state is large, and these actions can be rather frequent particularly if there are sniffing transport clients in use. Also the calculation is a simple function of the cluster state, so there is a lot of duplicated work here. This change introduces a small cache for this size computation to avoid all this duplicated work, and to avoid blocking network threads. Fixes elastic#39806.

DaveCTurner · 2019-03-08T09:54:14Z

@original-brownbear I was missing knowledge of PlainListenableActionFuture. Turns out a cache is quite easy to implement once I discovered that. I opened #39827 which avoids any context switches if the computation is already done, but which runs the serialisation on the generic threadpool if not.

DaveCTurner · 2019-03-09T13:37:18Z

We discussed this and decided we'd like to remove the reporting of the compressed cluster state size here. Ideally we will deprecate this in 6.7 and remove it in 7.0 by following these steps:

Introduce a flag on the cluster state API, ?compressed_cluster_state_size, and a corresponding flag in a ClusterStateRequest, allowing users to suppress this calculation. The flag defaults to true. Emit a deprecation warning when processing a ClusterStateRequest with this flag set. Backport this change all the way to 6.7.
Remove the calculation, ignoring the flag. Emit a deprecation warning when the flag is mentioned. Backport this change to 7.0.
Remove the flag in master only.

The BWC involved in the first step seems to be tricky, because we process ClusterStateRequests all over the place. Not only are they used directly in many REST tests (e.g. to find the master) but they are also used to serve other API endpoints like GET _cluster/settings and GET _cat/nodes. This means that in a mixed 6.7/6.x cluster (x < 7) with a 6.7 master we might call one of these APIs on a 6.6 node and the ClusterStateRequest it sends to the master would default this flag to true, resulting in a deprecation warning. This causes tests to fail.

I'm currently stuck trying to work out how best to permit (but not require) this warning to happen only if sending requests via a 6.x node.

Today when we respond to a `cluster:monitor/state` action we serialize the current cluster state on the network thread in order to return its size when compressed. This serialization can be expensive if the cluster state is very large. Yet the size we return is not a useful number to report to clients. We plan to remove the size from the cluster state response because it is not a useful number to report to clients. This is the first step, in which the size computation becomes optional and deprecated. Relates elastic#39806

DaveCTurner · 2019-04-03T13:07:53Z

Closed by the PRs above.

DaveCTurner added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 7, 2019

DaveCTurner added the team-discuss label Mar 7, 2019

DaveCTurner mentioned this issue Mar 8, 2019

Cache compressed cluster state size #39827

Closed

DaveCTurner removed the team-discuss label Mar 8, 2019

DaveCTurner mentioned this issue Mar 12, 2019

Deprecate size in cluster state response #39951

Closed

This was referenced Mar 13, 2019

Stop returning cluster state size by default #40016

Merged

Remove cluster state size #40061

Merged

Remove es.cluster_state.size hard failure #40111

Merged

DaveCTurner closed this as completed Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster state is serialized on the network thread #39806

Cluster state is serialized on the network thread #39806

DaveCTurner commented Mar 7, 2019

elasticmachine commented Mar 7, 2019

original-brownbear commented Mar 7, 2019

DaveCTurner commented Mar 8, 2019

DaveCTurner commented Mar 9, 2019

DaveCTurner commented Apr 3, 2019

Cluster state is serialized on the network thread #39806

Cluster state is serialized on the network thread #39806

Comments

DaveCTurner commented Mar 7, 2019

elasticmachine commented Mar 7, 2019

original-brownbear commented Mar 7, 2019

DaveCTurner commented Mar 8, 2019

DaveCTurner commented Mar 9, 2019

DaveCTurner commented Apr 3, 2019