-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219
Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219
Conversation
…ster and the current node is not master eligible
Hi @masseyke, I've created a changelog YAML for you. |
Pinging @elastic/es-data-management (Team:Data Management) |
@masseyke Just a note - in the description it says |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Keith
Left a few minor suggestions
summary = String.format( | ||
Locale.ROOT, | ||
"No master node observed in the last %s, and this node is not master eligible. Reaching out to a master-eligible node" | ||
+ " for more information, but no result yet.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say we couldn't further diagnose as we couldn't reach to a master eligible node?
the "no result yet" bit implies we're still working on it or that a different result is coming somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's polling random master eligible nodes for information, and we will get a result from one in the future (even if that result is TimedOutException or equivalent). If there were no master eligible nodes at all we would have reported that in another branch. The only way we'd get here and not have a result coming in the future is if we have a bug in the code (like the one fixed by #89014 -- it actually leads to this message being given with no result ever coming from a remote node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure. I'm not sure we want to propagate this message in the health API though as the API is request/response. With that in mind I think it'd make sense to tell the API user "we couldn't gather more information" as opposed to an ominous "not yet" :)
I understand that the diagnostics service is a continuously running service however, we seem to pass on this summary in the health API so we either:
- make sure the message makes sense in the API too (which is what I was proposing)
- keep the message as is (makes sense in the context of the diagnostics service) but we parse and rewrite it in the API to make sennse there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha I thought the "not yet" made it more optimistic (check back later and we might know more) than more ominous, but I can drop it.
if (explain) { | ||
details = CoordinationDiagnosticsDetails.EMPTY; | ||
} else { | ||
details = CoordinationDiagnosticsDetails.EMPTY; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
details always stay empty here. Do we want to return the "local details" if explain is true? ie. the details section with the local master history and such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah might as well -- based on the if/else doing the same thing I can only assume I must have intended to do that and then... I'm not sure why I didn't.
details = CoordinationDiagnosticsDetails.EMPTY; | ||
} | ||
} else { | ||
// It should not be possible to get here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this branch even exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change it to throw an AssertionError instead I guess. I just figured that eventually somehow we'd get into the impossible situation, and better to return something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually how about I assert that we're not there so that it'll fail if we have assertions enabled, but I'll return what I'm currently returning so that it won't blow up on the user if we've somehow introduced a bug and missed it in testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what is the else
?
RemoteMasterHealthResult
must have a result or an exception, if this is not enforced, let's enforced it in the constructor there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks Keith
This PR builds on #86524, #87482, and #87306 by supporting the case where there has been no master node in the last 30 second, no node has been elected master, and the current node is not master eligible. This is branch 1.2.2.3 in the diagram at #87482 (comment).
The outline of the logic is that when we see that the master node has gone null, we start polling a random master-eligible node for its CoordinationDiagnosticsResult. Once a diagnoseMasterStability() request comes in we look at the CoordinationDiagnosticsResult from the master eligible node, and the result is one of the following:
(1) No result yet:
(2) Non-green status:
(3) Green remote status:
(4) Timeout exception: