Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a transport action to get cluster formation info #87306

Merged

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Jun 1, 2022

This PR exposes ClusterFormationFailureHelper.ClusterFormationState in a transport action (ClusterFormationInfoAction), with the intention that it be used as part of the master stability health check. This was started as #85782, but we realized that we needed more fine-grained information than just the string description of the reason the cluster formation failed.
2022-04-18 Coordination health
This action will be used in branch 1.2.2.4 of the attached diagram (near the bottom right corner). That is, when we detect that there has been no master node in the last 30 seconds (default), and when PeerFinder does not believe that there is an elected master node, and when the current node is master eligible, we will use this transport action to asynchronously send a request to all master-eligible nodes to gather their ClusterFormationStates.

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've updated the changelog YAML for you.

@masseyke masseyke marked this pull request as ready for review June 14, 2022 18:16
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jun 14, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@masseyke
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith

Left a few comments

in.readBoolean(),
readStatusInfo(in),
in.readList(
in1 -> new JoinStatus(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have a more meaningful name for the inner reader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or maybe make JoinStatus readable from the wire too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JoinStatus is now Writeable and readable from the wire.

);
}

private static StatusInfo readStatusInfo(StreamInput in) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth making StatusInfo Writeable?

/**
* This transport action fetches the ClusterFormationState from a remote node.
*/
public static class TransportAction extends HandledTransportAction<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good tho, it's just a simple HandledTransportAction here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah of course - I just realised it does the needful

        transportService.registerRequestHandler(actionName, executor, false, canTripCircuitBreaker, requestReader, new TransportHandler());

Sorry for the noise Keith


public static class Request extends ActionRequest {

public Request() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to request this information from all master eligible nodes so the request should be able to express that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a point-to-point request which I think makes sense as-is. The caller will have to do the work of sending one to each master of course, but that's ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okido, sounds good

@@ -284,7 +284,7 @@ public Coordinator(
this.nodeHealthService = nodeHealthService;
}

private ClusterFormationState getClusterFormationState() {
public ClusterFormationState getClusterFormationState() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner does it make sense to send this over if the discovery cluster formation timeout has not lapsed yet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. If we only just lost the master then it might be a little incomplete but I expect it to converge pretty quickly.

);
}

private static boolean calculateHasDiscoveredQuorum(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would it make sense to obey the isXXX/hasXXX naming convention for methods returning boolean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we're trying to calculate the value to set hasDiscoveredQuorum to. I think if I named this hasDiscoveredQuorum() the expectation would be that this is a simple getter on that field. But it's not -- it's meant to be called just once to calculate what that field ought to be. Subsequent access would be through the record's access method to the hasDiscoveredQuorum field.

@masseyke masseyke requested a review from andreidan June 16, 2022 14:45
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this Keith

/**
* This transport action fetches the ClusterFormationState from a remote node.
*/
public static class TransportAction extends HandledTransportAction<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah of course - I just realised it does the needful

        transportService.registerRequestHandler(actionName, executor, false, canTripCircuitBreaker, requestReader, new TransportHandler());

Sorry for the noise Keith

Comment on lines 50 to 51
out.writeOptionalString(status == null ? null : status.name());
out.writeOptionalString(info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we changing the semantics for StatusInfo ? I don't think these fields are optional now. I think we should keep them non-null (ie. mandatory).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was nothing preventing them from being null. I thought it was very unlikely that someone would pass in nulls (what would be the point?). But if they do, I don't want it to blow up during serialization. From what I can tell there's no good way to enforce null checks on a record. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't change the behaviour here.
Usually, we mark nullable fields with @Nullable or use Optional when we accept null, but neither is present on this object.

IMO we should enforce/relax semantics in a separate and deliberate effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I caught up with @dakrone on this and we were thinking of aligning the object to the requirements (ie. non-null fields) via Objects.requireNonNull calls in the constructor, alongside strict (non-optional) serialisation.

What do you think @masseyke ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for enforcing non-nullable fields. I'll change the serialization to assume no nulls, and we can enforce it on a subsequent PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants