Polling for cluster diagnostics information #89014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

masseyke merged 24 commits into elastic:main from masseyke:feature/polling-cluster-diag-no-mutex

Aug 8, 2022

Member

masseyke commented Aug 1, 2022

As shown in step 1.2.2.3 in the diagram at #85624 (comment), if there has not been a node elected master for more than 30 seconds, then when a user queries a non-master-eligible node it is supposed to reach out to a master-eligible node to gather information about what that node thinks of master stability. Since we don't want to block a user request while making those remote calls, it should be done speculatively so that we have the information when it is needed. If a non-master-eligible node gets a cluster state changed event that the master is null, then it will wait 10 seconds (in case it gets another cluster state changed event that the master has flipped to non-null quickly, which is the most likely scenario), then reach out to a random master-eligible node using a transport action (#87984) to get its cluster diagnostics data. This data is stashed away in case a user request comes in. Since the user won't hit this path for 30 seconds after a master has become null (because we tell the user they have a stable master unless it's been missing for more than 30 seconds), we have plenty of time for this -- 10 seconds of delay in case we're notified that there is a new master plus 10 seconds to get the result back from the master-eligible node before timing out, plus 10 seconds to spare.
Note that the pattern used in this PR for doing asynchronous polling is very similar to what was done in #88874. The main differences are that we're polling a single master node here instead of all of them, and we're using a different transport action.
Also note that there will be a follow-up PR that actually uses this information to make decisions about master stability. This PR only keeps it for later use.

masseyke added 6 commits

August 1, 2022 10:04


          Initial commit

dd6b57d


          merged main

f95372c


          fixing comments

c47abb0


          merging main

530bac6


          adding integration tests

c10f5ce


          cleaning up

d5cf76d

masseyke added >enhancement :Data Management/Health v8.5.0 labels

Collaborator

elasticsearchmachine commented Aug 1, 2022

Hi @masseyke, I've created a changelog YAML for you.


          Update docs/changelog/89014.yaml

3805d60

masseyke mentioned this pull request

Polling for cluster diagnostics information #88562

Closed

masseyke added 5 commits

August 1, 2022 16:34


          updating comments

dee7aff


          Merge branch 'feature/polling-cluster-diag-no-mutex' of github.com:ma…

fb11578

…sseyke/elasticsearch into feature/polling-cluster-diag-no-mutex


          improving comments

5c06878


          checkstyle

6ffcf7a


          reverting accidental change

59409ff

masseyke requested review from andreidan and jbaiera

August 1, 2022 22:09

masseyke marked this pull request as ready for review

August 1, 2022 22:09

Collaborator

elasticsearchmachine commented Aug 1, 2022

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine added the Team:Data Management label

masseyke added 3 commits

August 1, 2022 17:28


          adding a unit test

6e6328a


          ensuring a stable cluster before running test

c29f609


          merging main

7754f88

andreidan reviewed

View reviewed changes

Contributor

andreidan left a comment

Thanks for working on this Keith

Left a few comments

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

@@ @@ -814,6 +840,198 @@ private Scheduler.Cancellable fetchClusterFormationInfo( @@
                       }, remoteRequestInitialDelay, ThreadPool.Names.SAME);
                   }
+                  void beginPollingRemoteStableMasterHealthIndicatorService() {

Contributor

andreidan Aug 3, 2022

This method and its overloads are rather confusing as the methods don't call into the health indicator service but into are remote diagnostics service.
Should we rename them?
ie. pollRemoteMasterStabilityDiagnostic or something along those lines

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

+                   * for diagnosis. It is null when no polling is occurring.
+                   * The field is accessed (reads/writes) from multiple threads, and is also reassigned on multiple threads.
+                   */
+                  volatile AtomicReference<Scheduler.Cancellable> remoteStableMasterHealthIndicatorTask = null;

Contributor

andreidan Aug 3, 2022

Similarly, the naming is a bit confusing as it's not storing anything that has to do with the StableMasterHealthIndicator

Contributor

andreidan Aug 3, 2022

Does this need to be an AtomicReference or a volatile Cancellable would suffice?

Member Author

masseyke Aug 3, 2022

We potentially assign this multiple times on the same polling "stack". So I think you're right that it doesn't necessarily have to be atomic, but it has to be some reference because we need to access it from the cancel method. That is, if I only had a Cancellable here, I couldn't reassign that same Cancellable reference to a new Cancellable and still have access to it in the cancel method. I can't just assign it to the object-level variable here because there might be more than one polling stack running at the same time.

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

+                  /**
+                   * This wraps the responseConsumer in a Consumer that will run rescheduleDiagnosticsFetchConsumer() after responseConsumer has
+                   * completed, adding the resulting Cancellable to cancellableConsumer.

Contributor

andreidan Aug 3, 2022

Shall we update this to be accurate? ie cancellableConsumer is not a thing here

Contributor

andreidan Aug 4, 2022

Suggested change

      
                 * completed, adding the resulting Cancellable to cancellableConsumer.
          
                 * completed, adding the resulting Cancellable to cancellableReference.

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

Comment on lines 953 to 956

+                               * Since this block is not synchronized it is possible that this task is cancelled between the check above and when the
+                               * code below is run, but this is harmless and not worth the additional synchronization in the normal case. The result
+                               * will just be ignored.
+                               */

Contributor

andreidan Aug 3, 2022

This is a bit misleading as the check above is

if (masterEligibleNode == null) {

I think we should rephrase this comment

Member Author

masseyke Aug 3, 2022

I'm not sure why I put that there. The comment in rescheduleDiagnosticsFetchConsumer covers this, so I'll just remove this one.

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

Comment on lines 1011 to 1014

+                                   * Since this block is not synchronized it is possible that this task is cancelled between the check above and when the
+                                   * code below is run, but this is harmless and not worth the additional synchronization in the normal case. In that case
+                                   * the connection will just be closed and the transport request will not be made.
+                                   */

Contributor

andreidan Aug 3, 2022

This is a bit misleading as the check above is either

if (masterEligibleNode == null) {

or

if (masterEligibleNode.getVersion().onOrAfter(minSupportedVersion) == false) {

I think we should rephrase this comment

Member Author

masseyke Aug 3, 2022

What I was trying to say is better said in rescheduleDiagnosticsFetchConsumer, so removing this comment.

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

+                      }, remoteRequestInitialDelay, ThreadPool.Names.SAME);
+                  }
+                  void cancelPollingRemoteStableMasterHealthIndicatorService() {

Contributor

andreidan Aug 3, 2022

We should rename this as we're not polling StableMasterHealthIndicatorService

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Show resolved Hide resolved

masseyke added 4 commits

August 3, 2022 08:38


          merging main

96222cf


          renaming methods

c826129


          code review feedback

d2f7978


          null check

34a7be6


          combining fetchClusterFormationInfo and fetchCoordinationDiagnostics …

7678fd9

…into one method

masseyke requested a review from andreidan

August 3, 2022 19:14


          fixed version number

40894ec

andreidan reviewed

View reviewed changes

Contributor

andreidan left a comment

Thanks for iterating on this Keith.

This looks great ! Left a few more minor suggestions

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

                   /**
                    * This is the amount of time that we wait before scheduling a remote request to gather diagnostic information. It is not
                    * user-configurable, but is non-final so that integration tests don't have to waste 10 seconds.
                    */
                   // Non-private for testing
-                  TimeValue remoteRequestInitialDelay = new TimeValue(10, TimeUnit.SECONDS);
+                  public TimeValue remoteRequestInitialDelay = new TimeValue(10, TimeUnit.SECONDS);

Contributor

andreidan Aug 4, 2022 •

edited

Loading

I think we can keep this default visible if we move StableMasterDisruptionIT to the org.elasticsearch.cluster.coordination package (it doesn't really need to be under discovery - and the cluster.coordination package is better suited given the indicator is under the same package too)

update: I think the purpose of StableMasterDisruptionIT is a bit overloaded as it tests cluster formation related things and health service indicator reporting. The tests that we added in connection to what the StableMasterHealthIndicatorService reports should be moved in a StableMasterHealthIndicatorDisruptionIT test that matches the cluster.coordination package

What do you think?

Member Author

masseyke Aug 4, 2022

I'll take this out of this PR and do that in another one since it's unrelated to this work.

Contributor

andreidan Aug 4, 2022

Sounds good, thanks Keith

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

+                  /**
+                   * This wraps the responseConsumer in a Consumer that will run rescheduleDiagnosticsFetchConsumer() after responseConsumer has
+                   * completed, adding the resulting Cancellable to cancellableConsumer.

Contributor

andreidan Aug 4, 2022

Suggested change

      
                 * completed, adding the resulting Cancellable to cancellableConsumer.
          
                 * completed, adding the resulting Cancellable to cancellableReference.

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

+                   * exception are transformed into a common type T with responseToResultFunction or exceptionToResultFunction, and then consumed by
+                   * responseConsumer.
+                   *
+                   * @param masterEligibleNode        The master eligible node to be queried, or null if we do not yet know of a master eligible node

Contributor

andreidan Aug 4, 2022

Shall we also mention that if null is passed that's the result the response consumer will get?

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

+                  }
+                  /**
+                   * This method connects to masterEligibleNode and sends it a transport request for a remote response of type R. The remote response or

Contributor

andreidan Aug 4, 2022

Suggested change

      
                 * This method connects to masterEligibleNode and sends it a transport request for a remote response of type R. The remote response or
          
                 * This method connects to masterEligibleNode and sends it a transport request for a response of type R. The  response or

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

Comment on lines 929 to 930

		* @param exceptionToResultFunction A function that converts a remote exception to the response type expected by the responseConsumer
		* @param responseToResultFunction A function that converts a remote response to the response type expected by the responseConsumer

Contributor

andreidan Aug 4, 2022

Could these 2 functions be replaced by a BiFunction<R, Exception, T> responseTransformer and then we can call this fetchRemoteObject with a method reference (transformResponse?) that takes either a response R or an Exception?

ie.

transformResponse(@Nullable R response, @Nullable Exception e) { 
 assert r != null || e != null : "a response or an exception must be provided";  
 if( response != null) { 
    return new ClusterFormationStateOrException(response.getClusterFormationState());
   } else {
   return new ClusterFormationStateOrException(e);
  }

What do you think?

Member Author

masseyke Aug 4, 2022

That seems like it'll be a little less clear to read, but I'll do it and we can see.

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Outdated

+                   * @param minSupportedVersion       The minimum version that the transport action works with
+                   * @return A Cancellable for the task that is scheduled to fetch the remote information
+                   */
+                  private <R extends ActionResponse, T> Scheduler.Cancellable fetchRemoteObject(

Contributor

andreidan Aug 4, 2022

Is Object a bit too generic here? (and possibly overlapping the java.lang.Object concept?)

Should we call it something along the lines of sendNodeRequest (as the parameters do the rest in documenting what it's actually fetching) or fetchRemoteDiagnosticInfo ?

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java Show resolved Hide resolved


          code review feedback

6ba4318

masseyke requested a review from andreidan

August 4, 2022 14:12

Member Author

masseyke commented Aug 4, 2022

@elasticsearchmachine update branch

masseyke added 2 commits

August 4, 2022 12:45


          Merge branch 'main' into feature/polling-cluster-diag-no-mutex

ae7cabd


          merging main

af06b31

andreidan approved these changes

View reviewed changes

Contributor

andreidan left a comment

Thanks for iterating on this Keith

LGTM 🚀

masseyke merged commit ee33383 into elastic:main

masseyke mentioned this pull request

Fixing remote master stability request when there has never been an elected master #89214

Merged

masseyke deleted the feature/polling-cluster-diag-no-mutex branch

August 9, 2022 16:01

masseyke mentioned this pull request

Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

Merged

elasticsearchmachine pushed a commit that referenced this pull request


          Fixing remote master stability request when there has never been an e…

e4a19d4

…lected master (#89214)

This fixes an edge case in the master stability polling code from
#89014. If there has not been an elected master node for the entire life
of a non-master-eligible node, then `clusterChanged()` will have never
been called on that node, so
`beginPollingRemoteMasterStabilityDiagnostic()` will have never been
called. And even though the node might know of some master-eligible
nodes, it will never have requested diagnostic information from them.
This PR adds a call to `beginPollingRemoteMasterStabilityDiagnostic` in
`CoordinationDiagnosticsService`'s constructor to cover this edge case.
In almost all cases, `clusterChanged()` will be called within 10 seconds
so the polling will never occur. However if there is no master node then
there will be no cluster changed events, and `clusterChanged()` will not
be called, and the results of the polling will likely be useful. This PR
has several possibly controversial pieces of code. I'm listing them here
with some discussion:

1. Because there is now a call to `beginPollingRemoteMasterStabilityDiagnostic()` in the ~~constructor~~ object's initialization code, `beginPollingRemoteMasterStabilityDiagnostic()` is no longer solely called from the cluster change thread. However, this call happens before the object is registered as a cluster service listener, so there is no new thread safety concern.
2. Because there is now a call to `beginPollingRemoteMasterStabilityDiagnostic()` in the ~~constructor~~ object's initialization code, we have to explicitly switch to the system context so that the various transport requests work in secure mode.
3. ~~When we're in the constructor, we don't actually know yet whether we're a master eligible node or not, so we kick off `beginPollingRemoteMasterStabilityDiagnostic()` for all node types, including master-eligible nodes. This will be fairly harmless for master eligible nodes though. In the worst case, they'll retrieve some information that they'll never use. This explains why `clusterChanged()` now cancels polling even if we are on a master eligible node.~~
4. ~~It is now possible that we use `clusterService.state()` before it is ready when we're trying to get the list of master-eligible peers. In production mode this method returns null, so we can check that before using it. If assertions are enabled in the JVM, just calling that method throws an `AssertionError`. I'm currently catching that with the assumption that it is harmless because there does not seem to be a way around it (without even further complicating code).~~
5. ~~It is now possible that we call `transportService.sendRequest()` before the transport service is ready. This happens if the server is initializing unusually slowly (i.e. it takes more than 10 seconds to complete the `Node` constructor) and if assertions are enabled. I don't see a way around this without further complicating the code, so I'm catching `AssertionError` and moving on, with the assumption that it will work 10 seconds later when it runs again. I'm also catching and storing `Exception`, which I think I should have been doing before anyway.~~

Note: Points 3, 4, and 5 are no longer relevant because I moved the call
to `beginPollingRemoteMasterStabilityDiagnostic()` out of the
constructor, and am now calling it after the transport service and
cluster state have been initialized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Health >enhancement Team:Data Management v8.5.0