Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster state is serialized on the network thread #39806

Closed
DaveCTurner opened this issue Mar 7, 2019 · 5 comments
Closed

Cluster state is serialized on the network thread #39806

DaveCTurner opened this issue Mar 7, 2019 · 5 comments
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection.

Comments

@DaveCTurner
Copy link
Contributor

Today when we respond to a cluster:monitor/state action we serialize the current cluster state:

PublicationTransportHandler.serializeFullClusterState(currentState, Version.CURRENT).length(), false));

However, we do so on the network thread:

// very lightweight operation in memory, no need to fork to a thread
return ThreadPool.Names.SAME;

This is a problem if the cluster state is large, because it consumes a network thread doing non-network things. It's particularly a problem if using lots of transport clients with sniffing enabled since each client will, by default, call this method every 5 seconds.

Yet we do this serialization to get the compressed size of the cluster state so that we can report it (see #3415). Many clients will not care about this. Maybe we can omit it? Alternatively, note that it only depends on the cluster state so maybe we can cache it and invalidate the cache on each cluster state update.

@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 7, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear
Copy link
Member

Alternatively, note that it only depends on the cluster state so maybe we can cache it and invalidate the cache on each cluster state update.

This sounds reasonable. Even if the state is 100MB or so it shouldn't take that much time to serialise it (I hope at least :)) so it seems like a reasonable improvement.

Though I wonder, why not just run this serialization on the generic thread pool, just to be safe (we could still cache to save some CPU)? That would seem like the cleaner approach and the cost of the context switches is probably well worth not having to see this in profiling on the transport thread when trying to reason out where the next slowness/blocker lives :)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 8, 2019
Today we compute the size of the compressed cluster state for each
`cluster:monitor/state` action. We do so by serializing the whole cluster state
and compressing it, and this happens on the network thread.  This calculation
can be rather expensive if the cluster state is large, and these actions can be
rather frequent particularly if there are sniffing transport clients in use.
Also the calculation is a simple function of the cluster state, so there is
a lot of duplicated work here.

This change introduces a small cache for this size computation to avoid all
this duplicated work, and to avoid blocking network threads.

Fixes elastic#39806.
@DaveCTurner
Copy link
Contributor Author

@original-brownbear I was missing knowledge of PlainListenableActionFuture. Turns out a cache is quite easy to implement once I discovered that. I opened #39827 which avoids any context switches if the computation is already done, but which runs the serialisation on the generic threadpool if not.

@DaveCTurner
Copy link
Contributor Author

We discussed this and decided we'd like to remove the reporting of the compressed cluster state size here. Ideally we will deprecate this in 6.7 and remove it in 7.0 by following these steps:

  1. Introduce a flag on the cluster state API, ?compressed_cluster_state_size, and a corresponding flag in a ClusterStateRequest, allowing users to suppress this calculation. The flag defaults to true. Emit a deprecation warning when processing a ClusterStateRequest with this flag set. Backport this change all the way to 6.7.
  2. Remove the calculation, ignoring the flag. Emit a deprecation warning when the flag is mentioned. Backport this change to 7.0.
  3. Remove the flag in master only.

The BWC involved in the first step seems to be tricky, because we process ClusterStateRequests all over the place. Not only are they used directly in many REST tests (e.g. to find the master) but they are also used to serve other API endpoints like GET _cluster/settings and GET _cat/nodes. This means that in a mixed 6.7/6.x cluster (x < 7) with a 6.7 master we might call one of these APIs on a 6.6 node and the ClusterStateRequest it sends to the master would default this flag to true, resulting in a deprecation warning. This causes tests to fail.

I'm currently stuck trying to work out how best to permit (but not require) this warning to happen only if sending requests via a 6.x node.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 12, 2019
Today when we respond to a `cluster:monitor/state` action we serialize the
current cluster state on the network thread in order to return its size when
compressed. This serialization can be expensive if the cluster state is very
large. Yet the size we return is not a useful number to report to clients.

We plan to remove the size from the cluster state response because it is not a
useful number to report to clients. This is the first step, in which the size
computation becomes optional and deprecated.

Relates elastic#39806
@DaveCTurner
Copy link
Contributor Author

Closed by the PRs above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants