Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master stability health indicator part 1 (when a master has been seen recently) #86524

Merged

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented May 6, 2022

The health indicator for master stability is very large. This is the first PR for the master stability check. It handles the case when we have seen a master node recently (the first two steps in the bulleted list below). The more complicated case when we have not seen a master node recently will be in a subsequent PR.
This indicator reports the health of master stability.

  • If we have had a master within the last 30 seconds, and that master has not changed more than 3 times in the last 30 minutes, then this will report GREEN.
  • If we have had a master within the last 30 seconds, but that master has changed more than 3 times in the last 30 minutes (and that is confirmed by checking with the last-known master), then this will report YELLOW.
  • If we have not had a master within the last 30 seconds, then this will will report RED with one exception. That exception is when all of the following apply:
    • No node is elected master
    • This node is not master eligible
    • Some node is master eligible
    • We ask a master-eligible node to run this indicator
    • That master-eligible node comes back with a result that is not RED.

Since this indicator needs to be able to run when there is no master at all, it does not depend on the dedicated health node (which requires the existence of a master). The indicator in this PR replaces InstanceHasMasterHealthIndicatorService as the "pre-flight check" that is run before running any other health indicators.

Here is an example of a response when the master has changed more than 3 times in the last 30 minutes (so the stable_master indicator is yellow). Note that all other indicators respond with status "unknown" since we do not evaluate them if there is no stable master:

$ curl localhost:9200/_internal/_health | python3 -mjson.tool
{
    "status": "yellow",
    "cluster_name": "master-test",
    "components": {
        "cluster_coordination": {
            "status": "yellow",
            "indicators": {
                "stable_master": {
                    "status": "yellow",
                    "summary": "The master has changed 4 times in the last 30m",
                    "details": {
                        "recent_masters": [
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "PLml3YF1S5CpxR97Ttn2Tw",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            },
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "e56BVOZ_T1OFK7bmG0PTHQ",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            },
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "xQOwd4blR4GN3Kkx2jJB0Q",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            },
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "Frlk9HMVRL2OOPNQhg2E_w",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            },
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "n4ww22m9SHe-QC-xxenhGQ",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            }
                        ],
                        "current_master": {
                            "Bi7HMyQ8QmSk12bJk6RNiA": {
                                "name": "master-node",
                                "ephemeral_id": "n4ww22m9SHe-QC-xxenhGQ",
                                "transport_address": "127.0.0.1:9300",
                                "external_id": "master-node",
                                "attributes": {
                                    "xpack.installed": "true"
                                },
                                "roles": [
                                    "data",
                                    "master"
                                ]
                            }
                        }
                    },
                    "impacts": [
                        {
                            "severity": 1,
                            "description": "The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.",
                            "impact_areas": [
                                "ingest"
                            ]
                        },
                        {
                            "severity": 1,
                            "description": "Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.",
                            "impact_areas": [
                                "deployment_management"
                            ]
                        },
                        {
                            "severity": 3,
                            "description": "Snapshot and restore will not work.",
                            "impact_areas": [
                                "backup"
                            ]
                        }
                    ],
                    "user_actions": [
                        {
                            "message": "The Elasticsearch cluster does not have a stable master node. This almost always requires expert assistance. Please contact Elastic support to resolve the problem."
                        }
                    ]
                }
            }
        },
        "data": {
            "status": "unknown",
            "indicators": {
                "shards_availability": {
                    "status": "unknown",
                    "summary": "Could not determine indicator state. Cluster state is not stable. Check details for critical issues keeping this indicator from running.",
                    "details": {
                        "reasons": {
                            "stable_master": "yellow"
                        }
                    }
                },
                "ilm": {
                    "status": "unknown",
                    "summary": "Could not determine indicator state. Cluster state is not stable. Check details for critical issues keeping this indicator from running.",
                    "details": {
                        "reasons": {
                            "stable_master": "yellow"
                        }
                    }
                }
            }
        },
        "snapshot": {
            "status": "unknown",
            "indicators": {
                "repository_integrity": {
                    "status": "unknown",
                    "summary": "Could not determine indicator state. Cluster state is not stable. Check details for critical issues keeping this indicator from running.",
                    "details": {
                        "reasons": {
                            "stable_master": "yellow"
                        }
                    }
                },
                "slm": {
                    "status": "unknown",
                    "summary": "Could not determine indicator state. Cluster state is not stable. Check details for critical issues keeping this indicator from running.",
                    "details": {
                        "reasons": {
                            "stable_master": "yellow"
                        }
                    }
                }
            }
        }
    }
}

And here is an example response when the master has been stable (not that other indicators are evaluated) :

$ curl localhost:9200/_internal/_health | python3 -mjson.tool
{
    "status": "green",
    "cluster_name": "master-test",
    "components": {
        "cluster_coordination": {
            "status": "green",
            "indicators": {
                "stable_master": {
                    "status": "green",
                    "summary": "The cluster has a stable master node",
                    "details": {
                        "recent_masters": [
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "n4ww22m9SHe-QC-xxenhGQ",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            }
                        ],
                        "current_master": {
                            "Bi7HMyQ8QmSk12bJk6RNiA": {
                                "name": "master-node",
                                "ephemeral_id": "n4ww22m9SHe-QC-xxenhGQ",
                                "transport_address": "127.0.0.1:9300",
                                "external_id": "master-node",
                                "attributes": {
                                    "xpack.installed": "true"
                                },
                                "roles": [
                                    "data",
                                    "master"
                                ]
                            }
                        }
                    }
                }
            }
        },
        "data": {
            "status": "green",
            "indicators": {
                "shards_availability": {
                    "status": "green",
                    "summary": "This cluster has all shards available.",
                    "details": {
                        "creating_primaries": 0,
                        "started_primaries": 1,
                        "unassigned_primaries": 0,
                        "initializing_replicas": 0,
                        "started_replicas": 1,
                        "initializing_primaries": 0,
                        "restarting_replicas": 0,
                        "restarting_primaries": 0,
                        "unassigned_replicas": 0
                    }
                },
                "ilm": {
                    "status": "green",
                    "summary": "ILM is running",
                    "details": {
                        "ilm_status": "RUNNING",
                        "policies": 15
                    }
                }
            }
        },
        "snapshot": {
            "status": "green",
            "indicators": {
                "repository_integrity": {
                    "status": "green",
                    "summary": "No repositories configured."
                },
                "slm": {
                    "status": "green",
                    "summary": "No policies configured",
                    "details": {
                        "slm_status": "RUNNING",
                        "policies": 0
                    }
                }
            }
        }
    }
}

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke
Copy link
Member Author

Relates #85941

@masseyke masseyke marked this pull request as ready for review May 11, 2022 17:11
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label May 11, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label May 11, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

@masseyke masseyke marked this pull request as draft May 11, 2022 18:35
@masseyke masseyke marked this pull request as ready for review May 11, 2022 21:16
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith.

Wasn't sure if you'd want a review for this as no reviewer was selected but went through it and added some thoughts.

* @param explain Whether to calculate and include the details and user actions in the result
* @return The HealthIndicatorResult for the given localMasterHistory
*/
private HealthIndicatorResult calculateWhenHaveSeenMasterRecently(MasterHistory localMasterHistory, boolean explain) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method indicates it's looking for the time something happened - the when. Which is not correct AFAICS. Should we name it differently? ie. onSeenMaster / checkIfCurrentMasterIsStable ? Or something along those lines?

* @param explain Whether to calculate and include the details in the result
* @return The HealthIndicatorResult for the given localMasterHistory
*/
private HealthIndicatorResult calculateWhenHaveNotSeenMasterRecently(MasterHistory localMasterHistory, boolean explain) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think this should be named according to what it does/what it looks for as opposed to when it's called (the when in the name is confusing as it points to calculating the time when an event happened)

Maybe onNotSeenMaster / onNoLocalMaster or something similar?

|| MasterHistory.hasMasterGoneNullAtLeastNTimes(remoteHistory, acceptableNullTransitions + 1)
|| MasterHistory.getNumberOfMasterIdentityChanges(remoteHistory) > acceptableIdentityChanges;
if (masterConfirmedUnstable) {
if (localNodeIsMaster == false && remoteHistory == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if remoteHistory is null but there's an in-flight request to get it I believe the status should still be GREEN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our design doc says we're OK with it being YELLOW here right?

* or changed identity repeatedly, then we have a problem (the master has confirmed what the local node saw).
*/
boolean masterConfirmedUnstable = localNodeIsMaster
|| remoteHistory == null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if remoteHistory is null but there's an in-flight request to get it I believe the status should still be GREEN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had multiple offline conversations about this and decided to make two changes:

  1. We're introducing the notion of a remote history becoming stale (Timing out stale remote master history #86936)
  2. If the remote master history is unset or stale we'll return the status as GREEN

@andreidan
Copy link
Contributor

@DaveCTurner what do you think about using the disruption tests infrastructure to add some integration tests for this indicator?

@DaveCTurner
Copy link
Contributor

Yes we should have some integ tests for this.

I also suggest trying to add some tests that use the CoordinatorTests framework since that would let you simulate various disruptions plus the passage of time (which is important for this indicator) as well as being single-threaded and repeatable.

@masseyke masseyke requested a review from andreidan May 25, 2022 22:52
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith.

I left some comments.

I am wondering if it'd make sense to separate the "master stability diagnosis" part of this into its own service (that'll listen to cluster change events, and perform the needed diagnosis when requested).

I've recommended something similar for the shards_availability indicator and we should revise it there too (// cc @jbaiera )

Currently, the indicator is mixing the diagnosis logic with the representation of the diagnosis.
Modularising it a bit might aid with future requirements where we'd want (possibly) the diagnosing logic to be used in other transports/services.

What do you think?

Comment on lines 83 to 84
private static final int DEFAULT_ACCEPTABLE_NULL_TRANSITIONS = 3;
private static final int SMALLEST_ALLOWED_ACCEPTABLE_NULL_TRANSITIONS = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move away from the acceptable/unacceptable language to having a single "threshold" concept?

eg. DEFAULT_NULL_TRANSITIONS_THRESHOLD = 4 (similar renaming for the corresponding setting and variable that reads it).
DEFAULT_IDENTITY_CHANGES_THRESHOD = 4 (similar renaming for the corresponding setting and variable that reads it).

I don't think we need constants for the min values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep the wording in line with the diagram we have. But I guess it really doesn't matter because that diagram is not meant to be long-lived.


// This is the default amount of time we look back to see if we have had a master at all, before moving on with other checks
private static final TimeValue DEFAULT_VERY_RECENT_PAST = new TimeValue(30, TimeUnit.SECONDS);
private static final TimeValue SMALLEST_ALLOWED_VERY_RECENT_PAST = new TimeValue(1, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the naming - would it make sense to have the naming be more specific ?
ie. DEFAULT_VERY_RECENT_PAST -> NODE_HAS_MASTER_LOOKUP_TIMEFRAME or something similar?

Same for the variables and corresponding setting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @return An empty HealthIndicatorDetails if explain is false, otherwise a HealthIndicatorDetails containing only "current_master"
* and "recent_masters"
*/
private HealthIndicatorDetails getSimpleDetails(boolean explain, MasterHistory localMasterHistory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would getDetails or getIndicatorDetails be more accurate? Not sure simple is adding information to the caller/method description.

What do you think?

for (DiscoveryNode recentMaster : recentMasters) {
if (recentMaster != null) {
builder.startObject();
builder.field("node_id", recentMaster.getId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would printing the start timestamp be useful here too? (to indicate how quickly they changed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no notion of timestamp here, mostly so that we don't leak out the notion of the machine's relative timestamps that don't mean anything outside of the JVM. You think we ought to add some public notion of time to https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/coordination/MasterHistory.java? That might be best in a couple of follow-up PRs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Good point. Let's leave it as is for now and decide later (after we use the indicator a bit) if we need more details.

@andreidan andreidan requested a review from dakrone June 1, 2022 17:47
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but this generally looks good!

private static final TimeValue SMALLEST_ALLOWED_VERY_RECENT_PAST = new TimeValue(1, TimeUnit.SECONDS);

// This is the default number of times that it is OK to have a master go null. Any more than this will be reported as a problem
private static final int DEFAULT_ACCEPTABLE_NULL_TRANSITIONS = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need all of these as private static variables, we can put the comment and the actual value (3 in this case) in the setting itself, which is already public and static?

I prefer to have the hard value with the setting itself, so if you look at the Setting you don't have to do another reference jump to see the default value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private static final String UNSTABLE_MASTER_BACKUP_IMPACT = "Snapshot and restore will not work.";

public static final Setting<TimeValue> VERY_RECENT_PAST_SETTING = Setting.timeSetting(
"health.master_history.very_recent_past",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a very descriptive name (just "very_recent_past"), I'd prefer something like "stability_window" to (hopefully?) clarify a little bit more what it does from the name alone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to NODE_HAS_MASTER_LOOKUP_TIMEFRAME_SETTING based on feedback from Andrei.

);

public static final Setting<Integer> ACCEPTABLE_NULL_TRANSITIONS_SETTING = Setting.intSetting(
"health.master_history.acceptable_null_transitions",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should "leak" the word/concept of 'null' here, perhaps "acceptable_no_master_transitions" or "acceptable_none_transitions"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the setting name and the setting variable name, but left internal variable names and method names as they were b/c using the word "null" really makes it easier to follow internally. Let me know if you disagree.

HealthStatus stableMasterStatus = HealthStatus.YELLOW;
String summary = String.format(
Locale.ROOT,
"The master has changed %d times in the last %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The master has changed %d times in the last %s",
"The elected master node has changed %d times in the last %s",

}
});
return builder.endObject();
} : HealthIndicatorDetails.EMPTY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should limit ternary operators to single-line statements, since it's difficult to parse as a human when it spans tens of lines.

Perhaps we could do something like:

if (explain == false) {
    return HealthIndicatorDetails.EMPTY;
} else {
    return <this big thing>;
}

|| MasterHistory.getNumberOfMasterIdentityChanges(remoteHistory) >= unacceptableIdentityChanges));
if (masterConfirmedUnstable) {
logger.trace("The master node {} thinks it is unstable", master);
final HealthStatus stableMasterStatus = HealthStatus.YELLOW;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this variable since it's only used a single place below passed in to the createIndicator method?

final HealthStatus stableMasterStatus = HealthStatus.YELLOW;
String summary = String.format(
Locale.ROOT,
"The cluster's master has alternated between %s and no master multiple times in the last %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The cluster's master has alternated between %s and no master multiple times in the last %s",
"The cluster's elected master node has alternated between %s and no elected master node multiple times in the last %s",

(I clarify here because I don't want to confuse a user between there being no master nodes in the cluster, and no elected master node in the cluster)

MasterHistory localMasterHistory,
@Nullable Exception remoteHistoryException
) {
return explain ? (builder, params) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about keeping ternary things to single-line statements.

List<UserAction> userActions = List.of();
logger.trace("The cluster has a stable master node");
HealthIndicatorDetails details = getSimpleDetails(explain, localMasterHistory);
return createIndicator(stableMasterStatus, summary, details, impacts, userActions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just pass HealthStatus.GREEN directly in here?

Collection<HealthIndicatorImpact> impacts = getUnstableMasterImpacts();
List<UserAction> userActions = getContactSupportUserActions(explain);
return createIndicator(
stableMasterStatus,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for passing HealthStatus.RED directly in here.

@masseyke masseyke requested review from dakrone and andreidan June 2, 2022 19:55
@masseyke
Copy link
Member Author

masseyke commented Jun 2, 2022

I am wondering if it'd make sense to separate the "master stability diagnosis" part of this into its own service (that'll listen to cluster change events, and perform the needed diagnosis when requested).

I've recommended something similar for the shards_availability indicator and we should revise it there too (// cc @jbaiera )

Currently, the indicator is mixing the diagnosis logic with the representation of the diagnosis. Modularising it a bit might aid with future requirements where we'd want (possibly) the diagnosing logic to be used in other transports/services.

What do you think?

@andreidan it sounds reasonable, but likely a very big change. What do you think about considering that for a follow-up PR (or PRs)?

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this Keith

Left a few nits but this looks great. Thanks for adding all the tests too ! 🚀

}

public void testRepeatedMasterIdentityChangesRecognizedAsUnstable() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could testRepeatedMasterChanges be the test? It's documented as a utility method but only used in this test

Comment on lines 88 to 92
private static final String UNSTABLE_MASTER_INGEST_IMPACT = "The cluster cannot create, delete, or rebalance indices, and cannot "
+ "insert or update documents.";
private static final String UNSTABLE_MASTER_DEPLOYMENT_MANAGEMENT_IMPACT = "Scheduled tasks such as Watcher, ILM, and SLM will not "
+ "work. The _cat APIs will not work.";
private static final String UNSTABLE_MASTER_BACKUP_IMPACT = "Snapshot and restore will not work.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impacts have slightly changed in the instance has master indicator since this PR was opened.


// This is the default amount of time we look back to see if we have had a master at all, before moving on with other checks
private static final TimeValue NODE_HAS_MASTER_LOOKUP_TIMEFRAME = new TimeValue(30, TimeUnit.SECONDS);
private static final TimeValue SMALLEST_ALLOWED_HAS_MASTER_LOOKUP_TIMEFRAME = new TimeValue(1, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should this be MIN_MASTER_LOOKUP_TIMEFRAME or hardcode the values in the setting definitions like we do with the others?

for (DiscoveryNode recentMaster : recentMasters) {
if (recentMaster != null) {
builder.startObject();
builder.field("node_id", recentMaster.getId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ Good point. Let's leave it as is for now and decide later (after we use the indicator a bit) if we need more details.

* This returns true if this node has seen a master node within the last few seconds
* @return true if this node has seen a master node within the last few seconds, false otherwise
*/
private boolean hasSeenMasterInVeryRecentPast() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: now that we renamed the setting to be more precise, should we rename this method too to reflect that it's looking in the configured "had master lookup timeframe" ?

@masseyke masseyke merged commit c95230d into elastic:master Jun 6, 2022
@masseyke masseyke deleted the feature/health-api-master-stability-indicator branch June 6, 2022 21:07
masseyke added a commit that referenced this pull request Jun 22, 2022
… service (#87482)

This builds on #86524 by supporting two additional conditions, both of which happen when there has been
no elected master for more than 30 seconds (from the queried node's point of view), and both of which return
a RED status:
(1) There are no master-eligible nodes found in the cluster
(2) The node being queried sees a master-eligible node that has been elected the master, but cannot join it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Health >feature Team:Clients Meta label for clients team Team:Data Management Meta label for data/management team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants