Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polling for cluster diagnostics information #89014

Merged
merged 24 commits into from
Aug 8, 2022

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Aug 1, 2022

As shown in step 1.2.2.3 in the diagram at #85624 (comment), if there has not been a node elected master for more than 30 seconds, then when a user queries a non-master-eligible node it is supposed to reach out to a master-eligible node to gather information about what that node thinks of master stability. Since we don't want to block a user request while making those remote calls, it should be done speculatively so that we have the information when it is needed. If a non-master-eligible node gets a cluster state changed event that the master is null, then it will wait 10 seconds (in case it gets another cluster state changed event that the master has flipped to non-null quickly, which is the most likely scenario), then reach out to a random master-eligible node using a transport action (#87984) to get its cluster diagnostics data. This data is stashed away in case a user request comes in. Since the user won't hit this path for 30 seconds after a master has become null (because we tell the user they have a stable master unless it's been missing for more than 30 seconds), we have plenty of time for this -- 10 seconds of delay in case we're notified that there is a new master plus 10 seconds to get the result back from the master-eligible node before timing out, plus 10 seconds to spare.
Note that the pattern used in this PR for doing asynchronous polling is very similar to what was done in #88874. The main differences are that we're polling a single master node here instead of all of them, and we're using a different transport action.
Also note that there will be a follow-up PR that actually uses this information to make decisions about master stability. This PR only keeps it for later use.

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke masseyke requested review from andreidan and jbaiera August 1, 2022 22:09
@masseyke masseyke marked this pull request as ready for review August 1, 2022 22:09
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 1, 2022
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith

Left a few comments

@@ -814,6 +840,198 @@ private Scheduler.Cancellable fetchClusterFormationInfo(
}, remoteRequestInitialDelay, ThreadPool.Names.SAME);
}

void beginPollingRemoteStableMasterHealthIndicatorService() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method and its overloads are rather confusing as the methods don't call into the health indicator service but into are remote diagnostics service.
Should we rename them?
ie. pollRemoteMasterStabilityDiagnostic or something along those lines

* for diagnosis. It is null when no polling is occurring.
* The field is accessed (reads/writes) from multiple threads, and is also reassigned on multiple threads.
*/
volatile AtomicReference<Scheduler.Cancellable> remoteStableMasterHealthIndicatorTask = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, the naming is a bit confusing as it's not storing anything that has to do with the StableMasterHealthIndicator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be an AtomicReference or a volatile Cancellable would suffice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We potentially assign this multiple times on the same polling "stack". So I think you're right that it doesn't necessarily have to be atomic, but it has to be some reference because we need to access it from the cancel method. That is, if I only had a Cancellable here, I couldn't reassign that same Cancellable reference to a new Cancellable and still have access to it in the cancel method. I can't just assign it to the object-level variable here because there might be more than one polling stack running at the same time.


/**
* This wraps the responseConsumer in a Consumer that will run rescheduleDiagnosticsFetchConsumer() after responseConsumer has
* completed, adding the resulting Cancellable to cancellableConsumer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we update this to be accurate? ie cancellableConsumer is not a thing here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* completed, adding the resulting Cancellable to cancellableConsumer.
* completed, adding the resulting Cancellable to cancellableReference.

Comment on lines 953 to 956
* Since this block is not synchronized it is possible that this task is cancelled between the check above and when the
* code below is run, but this is harmless and not worth the additional synchronization in the normal case. The result
* will just be ignored.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit misleading as the check above is

if (masterEligibleNode == null) {

I think we should rephrase this comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why I put that there. The comment in rescheduleDiagnosticsFetchConsumer covers this, so I'll just remove this one.

Comment on lines 1011 to 1014
* Since this block is not synchronized it is possible that this task is cancelled between the check above and when the
* code below is run, but this is harmless and not worth the additional synchronization in the normal case. In that case
* the connection will just be closed and the transport request will not be made.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit misleading as the check above is either

if (masterEligibleNode == null) {

or

if (masterEligibleNode.getVersion().onOrAfter(minSupportedVersion) == false) { 

I think we should rephrase this comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to say is better said in rescheduleDiagnosticsFetchConsumer, so removing this comment.

}, remoteRequestInitialDelay, ThreadPool.Names.SAME);
}

void cancelPollingRemoteStableMasterHealthIndicatorService() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename this as we're not polling StableMasterHealthIndicatorService

@masseyke masseyke requested a review from andreidan August 3, 2022 19:14
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Keith.

This looks great ! Left a few more minor suggestions

/**
* This is the amount of time that we wait before scheduling a remote request to gather diagnostic information. It is not
* user-configurable, but is non-final so that integration tests don't have to waste 10 seconds.
*/
// Non-private for testing
TimeValue remoteRequestInitialDelay = new TimeValue(10, TimeUnit.SECONDS);
public TimeValue remoteRequestInitialDelay = new TimeValue(10, TimeUnit.SECONDS);
Copy link
Contributor

@andreidan andreidan Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this default visible if we move StableMasterDisruptionIT to the org.elasticsearch.cluster.coordination package (it doesn't really need to be under discovery - and the cluster.coordination package is better suited given the indicator is under the same package too)

update: I think the purpose of StableMasterDisruptionIT is a bit overloaded as it tests cluster formation related things and health service indicator reporting. The tests that we added in connection to what the StableMasterHealthIndicatorService reports should be moved in a StableMasterHealthIndicatorDisruptionIT test that matches the cluster.coordination package

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take this out of this PR and do that in another one since it's unrelated to this work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks Keith


/**
* This wraps the responseConsumer in a Consumer that will run rescheduleDiagnosticsFetchConsumer() after responseConsumer has
* completed, adding the resulting Cancellable to cancellableConsumer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* completed, adding the resulting Cancellable to cancellableConsumer.
* completed, adding the resulting Cancellable to cancellableReference.

* exception are transformed into a common type T with responseToResultFunction or exceptionToResultFunction, and then consumed by
* responseConsumer.
*
* @param masterEligibleNode The master eligible node to be queried, or null if we do not yet know of a master eligible node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also mention that if null is passed that's the result the response consumer will get?

}

/**
* This method connects to masterEligibleNode and sends it a transport request for a remote response of type R. The remote response or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This method connects to masterEligibleNode and sends it a transport request for a remote response of type R. The remote response or
* This method connects to masterEligibleNode and sends it a transport request for a response of type R. The response or

Comment on lines 929 to 930
* @param exceptionToResultFunction A function that converts a remote exception to the response type expected by the responseConsumer
* @param responseToResultFunction A function that converts a remote response to the response type expected by the responseConsumer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these 2 functions be replaced by a BiFunction<R, Exception, T> responseTransformer and then we can call this fetchRemoteObject with a method reference (transformResponse?) that takes either a response R or an Exception?

ie.

transformResponse(@Nullable R response, @Nullable Exception e) { 
 assert r != null || e != null : "a response or an exception must be provided";  
 if( response != null) { 
    return new ClusterFormationStateOrException(response.getClusterFormationState());
   } else {
   return new ClusterFormationStateOrException(e);
  }

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like it'll be a little less clear to read, but I'll do it and we can see.

* @param minSupportedVersion The minimum version that the transport action works with
* @return A Cancellable for the task that is scheduled to fetch the remote information
*/
private <R extends ActionResponse, T> Scheduler.Cancellable fetchRemoteObject(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Object a bit too generic here? (and possibly overlapping the java.lang.Object concept?)

Should we call it something along the lines of sendNodeRequest (as the parameters do the rest in documenting what it's actually fetching) or fetchRemoteDiagnosticInfo ?

@masseyke masseyke requested a review from andreidan August 4, 2022 14:12
@masseyke
Copy link
Member Author

masseyke commented Aug 4, 2022

@elasticsearchmachine update branch

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Keith

LGTM 🚀

@masseyke masseyke merged commit ee33383 into elastic:main Aug 8, 2022
@masseyke masseyke deleted the feature/polling-cluster-diag-no-mutex branch August 9, 2022 16:01
elasticsearchmachine pushed a commit that referenced this pull request Aug 11, 2022
…lected master (#89214)

This fixes an edge case in the master stability polling code from
#89014. If there has not been an elected master node for the entire life
of a non-master-eligible node, then `clusterChanged()` will have never
been called on that node, so
`beginPollingRemoteMasterStabilityDiagnostic()` will have never been
called. And even though the node might know of some master-eligible
nodes, it will never have requested diagnostic information from them.
This PR adds a call to `beginPollingRemoteMasterStabilityDiagnostic` in
`CoordinationDiagnosticsService`'s constructor to cover this edge case.
In almost all cases, `clusterChanged()` will be called within 10 seconds
so the polling will never occur. However if there is no master node then
there will be no cluster changed events, and `clusterChanged()` will not
be called, and the results of the polling will likely be useful. This PR
has several possibly controversial pieces of code. I'm listing them here
with some discussion:

1. Because there is now a call to `beginPollingRemoteMasterStabilityDiagnostic()` in the ~~constructor~~ object's initialization code, `beginPollingRemoteMasterStabilityDiagnostic()` is no longer solely called from the cluster change thread. However, this call happens before the object is registered as a cluster service listener, so there is no new thread safety concern.
2. Because there is now a call to `beginPollingRemoteMasterStabilityDiagnostic()` in the ~~constructor~~ object's initialization code, we have to explicitly switch to the system context so that the various transport requests work in secure mode.
3. ~~When we're in the constructor, we don't actually know yet whether we're a master eligible node or not, so we kick off `beginPollingRemoteMasterStabilityDiagnostic()` for all node types, including master-eligible nodes. This will be fairly harmless for master eligible nodes though. In the worst case, they'll retrieve some information that they'll never use. This explains why `clusterChanged()` now cancels polling even if we are on a master eligible node.~~
4. ~~It is now possible that we use `clusterService.state()` before it is ready when we're trying to get the list of master-eligible peers. In production mode this method returns null, so we can check that before using it. If assertions are enabled in the JVM, just calling that method throws an `AssertionError`. I'm currently catching that with the assumption that it is harmless because there does not seem to be a way around it (without even further complicating code).~~
5. ~~It is now possible that we call `transportService.sendRequest()` before the transport service is ready. This happens if the server is initializing unusually slowly (i.e. it takes more than 10 seconds to complete the `Node` constructor) and if assertions are enabled. I don't see a way around this without further complicating the code, so I'm catching `AssertionError` and moving on, with the assumption that it will work 10 seconds later when it runs again. I'm also catching and storing `Exception`, which I think I should have been doing before anyway.~~

Note: Points 3, 4, and 5 are no longer relevant because I moved the call
to `beginPollingRemoteMasterStabilityDiagnostic()` out of the
constructor, and am now calling it after the transport service and
cluster state have been initialized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants