-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master stability health indicator part 1 (when a master has been seen recently) #86524
Master stability health indicator part 1 (when a master has been seen recently) #86524
Conversation
…as been seen recently)
Hi @masseyke, I've created a changelog YAML for you. |
…b.com:masseyke/elasticsearch into feature/health-api-master-stability-indicator
Relates #85941 |
Pinging @elastic/es-data-management (Team:Data Management) |
Pinging @elastic/clients-team (Team:Clients) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Keith.
Wasn't sure if you'd want a review for this as no reviewer was selected but went through it and added some thoughts.
...est/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorServiceTests.java
Outdated
Show resolved
Hide resolved
...est/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorServiceTests.java
Outdated
Show resolved
Hide resolved
...est/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorServiceTests.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
* @param explain Whether to calculate and include the details and user actions in the result | ||
* @return The HealthIndicatorResult for the given localMasterHistory | ||
*/ | ||
private HealthIndicatorResult calculateWhenHaveSeenMasterRecently(MasterHistory localMasterHistory, boolean explain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this method indicates it's looking for the time something happened - the when. Which is not correct AFAICS. Should we name it differently? ie. onSeenMaster
/ checkIfCurrentMasterIsStable
? Or something along those lines?
* @param explain Whether to calculate and include the details in the result | ||
* @return The HealthIndicatorResult for the given localMasterHistory | ||
*/ | ||
private HealthIndicatorResult calculateWhenHaveNotSeenMasterRecently(MasterHistory localMasterHistory, boolean explain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I think this should be named according to what it does/what it looks for as opposed to when it's called (the when in the name is confusing as it points to calculating the time when an event happened)
Maybe onNotSeenMaster
/ onNoLocalMaster
or something similar?
|| MasterHistory.hasMasterGoneNullAtLeastNTimes(remoteHistory, acceptableNullTransitions + 1) | ||
|| MasterHistory.getNumberOfMasterIdentityChanges(remoteHistory) > acceptableIdentityChanges; | ||
if (masterConfirmedUnstable) { | ||
if (localNodeIsMaster == false && remoteHistory == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if remoteHistory is null but there's an in-flight request to get it I believe the status should still be GREEN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our design doc says we're OK with it being YELLOW here right?
* or changed identity repeatedly, then we have a problem (the master has confirmed what the local node saw). | ||
*/ | ||
boolean masterConfirmedUnstable = localNodeIsMaster | ||
|| remoteHistory == null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if remoteHistory is null but there's an in-flight request to get it I believe the status should still be GREEN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had multiple offline conversations about this and decided to make two changes:
- We're introducing the notion of a remote history becoming stale (Timing out stale remote master history #86936)
- If the remote master history is unset or stale we'll return the status as GREEN
@DaveCTurner what do you think about using the disruption tests infrastructure to add some integration tests for this indicator? |
Yes we should have some integ tests for this. I also suggest trying to add some tests that use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Keith.
I left some comments.
I am wondering if it'd make sense to separate the "master stability diagnosis" part of this into its own service (that'll listen to cluster change events, and perform the needed diagnosis when requested).
I've recommended something similar for the shards_availability indicator and we should revise it there too (// cc @jbaiera )
Currently, the indicator is mixing the diagnosis logic with the representation of the diagnosis.
Modularising it a bit might aid with future requirements where we'd want (possibly) the diagnosing logic to be used in other transports/services.
What do you think?
private static final int DEFAULT_ACCEPTABLE_NULL_TRANSITIONS = 3; | ||
private static final int SMALLEST_ALLOWED_ACCEPTABLE_NULL_TRANSITIONS = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move away from the acceptable/unacceptable language to having a single "threshold" concept?
eg. DEFAULT_NULL_TRANSITIONS_THRESHOLD = 4
(similar renaming for the corresponding setting and variable that reads it).
DEFAULT_IDENTITY_CHANGES_THRESHOD = 4
(similar renaming for the corresponding setting and variable that reads it).
I don't think we need constants for the min values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep the wording in line with the diagram we have. But I guess it really doesn't matter because that diagram is not meant to be long-lived.
|
||
// This is the default amount of time we look back to see if we have had a master at all, before moving on with other checks | ||
private static final TimeValue DEFAULT_VERY_RECENT_PAST = new TimeValue(30, TimeUnit.SECONDS); | ||
private static final TimeValue SMALLEST_ALLOWED_VERY_RECENT_PAST = new TimeValue(1, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the naming - would it make sense to have the naming be more specific ?
ie. DEFAULT_VERY_RECENT_PAST
-> NODE_HAS_MASTER_LOOKUP_TIMEFRAME
or something similar?
Same for the variables and corresponding setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @return An empty HealthIndicatorDetails if explain is false, otherwise a HealthIndicatorDetails containing only "current_master" | ||
* and "recent_masters" | ||
*/ | ||
private HealthIndicatorDetails getSimpleDetails(boolean explain, MasterHistory localMasterHistory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would getDetails or getIndicatorDetails be more accurate? Not sure simple
is adding information to the caller/method description.
What do you think?
for (DiscoveryNode recentMaster : recentMasters) { | ||
if (recentMaster != null) { | ||
builder.startObject(); | ||
builder.field("node_id", recentMaster.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would printing the start timestamp be useful here too? (to indicate how quickly they changed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no notion of timestamp here, mostly so that we don't leak out the notion of the machine's relative timestamps that don't mean anything outside of the JVM. You think we ought to add some public notion of time to https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/coordination/MasterHistory.java? That might be best in a couple of follow-up PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ Good point. Let's leave it as is for now and decide later (after we use the indicator a bit) if we need more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but this generally looks good!
private static final TimeValue SMALLEST_ALLOWED_VERY_RECENT_PAST = new TimeValue(1, TimeUnit.SECONDS); | ||
|
||
// This is the default number of times that it is OK to have a master go null. Any more than this will be reported as a problem | ||
private static final int DEFAULT_ACCEPTABLE_NULL_TRANSITIONS = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need all of these as private static variables, we can put the comment and the actual value (3 in this case) in the setting itself, which is already public and static?
I prefer to have the hard value with the setting itself, so if you look at the Setting
you don't have to do another reference jump to see the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
private static final String UNSTABLE_MASTER_BACKUP_IMPACT = "Snapshot and restore will not work."; | ||
|
||
public static final Setting<TimeValue> VERY_RECENT_PAST_SETTING = Setting.timeSetting( | ||
"health.master_history.very_recent_past", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a very descriptive name (just "very_recent_past"), I'd prefer something like "stability_window" to (hopefully?) clarify a little bit more what it does from the name alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to NODE_HAS_MASTER_LOOKUP_TIMEFRAME_SETTING based on feedback from Andrei.
); | ||
|
||
public static final Setting<Integer> ACCEPTABLE_NULL_TRANSITIONS_SETTING = Setting.intSetting( | ||
"health.master_history.acceptable_null_transitions", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should "leak" the word/concept of 'null' here, perhaps "acceptable_no_master_transitions
" or "acceptable_none_transitions
"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the setting name and the setting variable name, but left internal variable names and method names as they were b/c using the word "null" really makes it easier to follow internally. Let me know if you disagree.
HealthStatus stableMasterStatus = HealthStatus.YELLOW; | ||
String summary = String.format( | ||
Locale.ROOT, | ||
"The master has changed %d times in the last %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The master has changed %d times in the last %s", | |
"The elected master node has changed %d times in the last %s", |
} | ||
}); | ||
return builder.endObject(); | ||
} : HealthIndicatorDetails.EMPTY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should limit ternary operators to single-line statements, since it's difficult to parse as a human when it spans tens of lines.
Perhaps we could do something like:
if (explain == false) {
return HealthIndicatorDetails.EMPTY;
} else {
return <this big thing>;
}
|| MasterHistory.getNumberOfMasterIdentityChanges(remoteHistory) >= unacceptableIdentityChanges)); | ||
if (masterConfirmedUnstable) { | ||
logger.trace("The master node {} thinks it is unstable", master); | ||
final HealthStatus stableMasterStatus = HealthStatus.YELLOW; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this variable since it's only used a single place below passed in to the createIndicator
method?
final HealthStatus stableMasterStatus = HealthStatus.YELLOW; | ||
String summary = String.format( | ||
Locale.ROOT, | ||
"The cluster's master has alternated between %s and no master multiple times in the last %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The cluster's master has alternated between %s and no master multiple times in the last %s", | |
"The cluster's elected master node has alternated between %s and no elected master node multiple times in the last %s", |
(I clarify here because I don't want to confuse a user between there being no master
nodes in the cluster, and no elected master node in the cluster)
MasterHistory localMasterHistory, | ||
@Nullable Exception remoteHistoryException | ||
) { | ||
return explain ? (builder, params) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about keeping ternary things to single-line statements.
List<UserAction> userActions = List.of(); | ||
logger.trace("The cluster has a stable master node"); | ||
HealthIndicatorDetails details = getSimpleDetails(explain, localMasterHistory); | ||
return createIndicator(stableMasterStatus, summary, details, impacts, userActions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just pass HealthStatus.GREEN
directly in here?
Collection<HealthIndicatorImpact> impacts = getUnstableMasterImpacts(); | ||
List<UserAction> userActions = getContactSupportUserActions(explain); | ||
return createIndicator( | ||
stableMasterStatus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for passing HealthStatus.RED
directly in here.
@andreidan it sounds reasonable, but likely a very big change. What do you think about considering that for a follow-up PR (or PRs)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for iterating on this Keith
Left a few nits but this looks great. Thanks for adding all the tests too ! 🚀
} | ||
|
||
public void testRepeatedMasterIdentityChangesRecognizedAsUnstable() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could testRepeatedMasterChanges
be the test? It's documented as a utility method but only used in this test
private static final String UNSTABLE_MASTER_INGEST_IMPACT = "The cluster cannot create, delete, or rebalance indices, and cannot " | ||
+ "insert or update documents."; | ||
private static final String UNSTABLE_MASTER_DEPLOYMENT_MANAGEMENT_IMPACT = "Scheduled tasks such as Watcher, ILM, and SLM will not " | ||
+ "work. The _cat APIs will not work."; | ||
private static final String UNSTABLE_MASTER_BACKUP_IMPACT = "Snapshot and restore will not work."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impacts have slightly changed in the instance has master indicator since this PR was opened.
|
||
// This is the default amount of time we look back to see if we have had a master at all, before moving on with other checks | ||
private static final TimeValue NODE_HAS_MASTER_LOOKUP_TIMEFRAME = new TimeValue(30, TimeUnit.SECONDS); | ||
private static final TimeValue SMALLEST_ALLOWED_HAS_MASTER_LOOKUP_TIMEFRAME = new TimeValue(1, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should this be MIN_MASTER_LOOKUP_TIMEFRAME
or hardcode the values in the setting definitions like we do with the others?
for (DiscoveryNode recentMaster : recentMasters) { | ||
if (recentMaster != null) { | ||
builder.startObject(); | ||
builder.field("node_id", recentMaster.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ Good point. Let's leave it as is for now and decide later (after we use the indicator a bit) if we need more details.
* This returns true if this node has seen a master node within the last few seconds | ||
* @return true if this node has seen a master node within the last few seconds, false otherwise | ||
*/ | ||
private boolean hasSeenMasterInVeryRecentPast() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: now that we renamed the setting to be more precise, should we rename this method too to reflect that it's looking in the configured "had master lookup timeframe" ?
… service (#87482) This builds on #86524 by supporting two additional conditions, both of which happen when there has been no elected master for more than 30 seconds (from the queried node's point of view), and both of which return a RED status: (1) There are no master-eligible nodes found in the cluster (2) The node being queried sees a master-eligible node that has been elected the master, but cannot join it
The health indicator for master stability is very large. This is the first PR for the master stability check. It handles the case when we have seen a master node recently (the first two steps in the bulleted list below). The more complicated case when we have not seen a master node recently will be in a subsequent PR.
This indicator reports the health of master stability.
Since this indicator needs to be able to run when there is no master at all, it does not depend on the dedicated health node (which requires the existence of a master). The indicator in this PR replaces InstanceHasMasterHealthIndicatorService as the "pre-flight check" that is run before running any other health indicators.
Here is an example of a response when the master has changed more than 3 times in the last 30 minutes (so the
stable_master
indicator is yellow). Note that all other indicators respond with status "unknown" since we do not evaluate them if there is no stable master:And here is an example response when the master has been stable (not that other indicators are evaluated) :