Master stability health indicator part 1 (when a master has been seen recently) #86524

masseyke · 2022-05-06T16:01:51Z

The health indicator for master stability is very large. This is the first PR for the master stability check. It handles the case when we have seen a master node recently (the first two steps in the bulleted list below). The more complicated case when we have not seen a master node recently will be in a subsequent PR.
This indicator reports the health of master stability.

If we have had a master within the last 30 seconds, and that master has not changed more than 3 times in the last 30 minutes, then this will report GREEN.
If we have had a master within the last 30 seconds, but that master has changed more than 3 times in the last 30 minutes (and that is confirmed by checking with the last-known master), then this will report YELLOW.
If we have not had a master within the last 30 seconds, then this will will report RED with one exception. That exception is when all of the following apply:
- No node is elected master
- This node is not master eligible
- Some node is master eligible
- We ask a master-eligible node to run this indicator
- That master-eligible node comes back with a result that is not RED.

Since this indicator needs to be able to run when there is no master at all, it does not depend on the dedicated health node (which requires the existence of a master). The indicator in this PR replaces InstanceHasMasterHealthIndicatorService as the "pre-flight check" that is run before running any other health indicators.

Here is an example of a response when the master has changed more than 3 times in the last 30 minutes (so the stable_master indicator is yellow). Note that all other indicators respond with status "unknown" since we do not evaluate them if there is no stable master:

$ curl localhost:9200/_internal/_health | python3 -mjson.tool
{
    "status": "yellow",
    "cluster_name": "master-test",
    "components": {
        "cluster_coordination": {
            "status": "yellow",
            "indicators": {
                "stable_master": {
                    "status": "yellow",
                    "summary": "The master has changed 4 times in the last 30m",
                    "details": {
                        "recent_masters": [
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "PLml3YF1S5CpxR97Ttn2Tw",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            },
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "e56BVOZ_T1OFK7bmG0PTHQ",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            },
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "xQOwd4blR4GN3Kkx2jJB0Q",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            },
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "Frlk9HMVRL2OOPNQhg2E_w",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            },
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "n4ww22m9SHe-QC-xxenhGQ",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            }
                        ],
                        "current_master": {
                            "Bi7HMyQ8QmSk12bJk6RNiA": {
                                "name": "master-node",
                                "ephemeral_id": "n4ww22m9SHe-QC-xxenhGQ",
                                "transport_address": "127.0.0.1:9300",
                                "external_id": "master-node",
                                "attributes": {
                                    "xpack.installed": "true"
                                },
                                "roles": [
                                    "data",
                                    "master"
                                ]
                            }
                        }
                    },
                    "impacts": [
                        {
                            "severity": 1,
                            "description": "The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.",
                            "impact_areas": [
                                "ingest"
                            ]
                        },
                        {
                            "severity": 1,
                            "description": "Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.",
                            "impact_areas": [
                                "deployment_management"
                            ]
                        },
                        {
                            "severity": 3,
                            "description": "Snapshot and restore will not work.",
                            "impact_areas": [
                                "backup"
                            ]
                        }
                    ],
                    "user_actions": [
                        {
                            "message": "The Elasticsearch cluster does not have a stable master node. This almost always requires expert assistance. Please contact Elastic support to resolve the problem."
                        }
                    ]
                }
            }
        },
        "data": {
            "status": "unknown",
            "indicators": {
                "shards_availability": {
                    "status": "unknown",
                    "summary": "Could not determine indicator state. Cluster state is not stable. Check details for critical issues keeping this indicator from running.",
                    "details": {
                        "reasons": {
                            "stable_master": "yellow"
                        }
                    }
                },
                "ilm": {
                    "status": "unknown",
                    "summary": "Could not determine indicator state. Cluster state is not stable. Check details for critical issues keeping this indicator from running.",
                    "details": {
                        "reasons": {
                            "stable_master": "yellow"
                        }
                    }
                }
            }
        },
        "snapshot": {
            "status": "unknown",
            "indicators": {
                "repository_integrity": {
                    "status": "unknown",
                    "summary": "Could not determine indicator state. Cluster state is not stable. Check details for critical issues keeping this indicator from running.",
                    "details": {
                        "reasons": {
                            "stable_master": "yellow"
                        }
                    }
                },
                "slm": {
                    "status": "unknown",
                    "summary": "Could not determine indicator state. Cluster state is not stable. Check details for critical issues keeping this indicator from running.",
                    "details": {
                        "reasons": {
                            "stable_master": "yellow"
                        }
                    }
                }
            }
        }
    }
}

And here is an example response when the master has been stable (not that other indicators are evaluated) :

$ curl localhost:9200/_internal/_health | python3 -mjson.tool
{
    "status": "green",
    "cluster_name": "master-test",
    "components": {
        "cluster_coordination": {
            "status": "green",
            "indicators": {
                "stable_master": {
                    "status": "green",
                    "summary": "The cluster has a stable master node",
                    "details": {
                        "recent_masters": [
                            {
                                "Bi7HMyQ8QmSk12bJk6RNiA": {
                                    "name": "master-node",
                                    "ephemeral_id": "n4ww22m9SHe-QC-xxenhGQ",
                                    "transport_address": "127.0.0.1:9300",
                                    "external_id": "master-node",
                                    "attributes": {
                                        "xpack.installed": "true"
                                    },
                                    "roles": [
                                        "data",
                                        "master"
                                    ]
                                }
                            }
                        ],
                        "current_master": {
                            "Bi7HMyQ8QmSk12bJk6RNiA": {
                                "name": "master-node",
                                "ephemeral_id": "n4ww22m9SHe-QC-xxenhGQ",
                                "transport_address": "127.0.0.1:9300",
                                "external_id": "master-node",
                                "attributes": {
                                    "xpack.installed": "true"
                                },
                                "roles": [
                                    "data",
                                    "master"
                                ]
                            }
                        }
                    }
                }
            }
        },
        "data": {
            "status": "green",
            "indicators": {
                "shards_availability": {
                    "status": "green",
                    "summary": "This cluster has all shards available.",
                    "details": {
                        "creating_primaries": 0,
                        "started_primaries": 1,
                        "unassigned_primaries": 0,
                        "initializing_replicas": 0,
                        "started_replicas": 1,
                        "initializing_primaries": 0,
                        "restarting_replicas": 0,
                        "restarting_primaries": 0,
                        "unassigned_replicas": 0
                    }
                },
                "ilm": {
                    "status": "green",
                    "summary": "ILM is running",
                    "details": {
                        "ilm_status": "RUNNING",
                        "policies": 15
                    }
                }
            }
        },
        "snapshot": {
            "status": "green",
            "indicators": {
                "repository_integrity": {
                    "status": "green",
                    "summary": "No repositories configured."
                },
                "slm": {
                    "status": "green",
                    "summary": "No policies configured",
                    "details": {
                        "slm_status": "RUNNING",
                        "policies": 0
                    }
                }
            }
        }
    }
}

…as been seen recently)

elasticsearchmachine · 2022-05-06T16:02:14Z

Hi @masseyke, I've created a changelog YAML for you.

…b.com:masseyke/elasticsearch into feature/health-api-master-stability-indicator

masseyke · 2022-05-11T15:51:34Z

Relates #85941

elasticmachine · 2022-05-11T17:11:49Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticmachine · 2022-05-11T17:28:55Z

Pinging @elastic/clients-team (Team:Clients)

andreidan

Thanks for working on this Keith.

Wasn't sure if you'd want a review for this as no reviewer was selected but went through it and added some thoughts.

...est/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorServiceTests.java

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

andreidan · 2022-05-18T11:42:15Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+     * @param explain Whether to calculate and include the details and user actions in the result
+     * @return The HealthIndicatorResult for the given localMasterHistory
+     */
+    private HealthIndicatorResult calculateWhenHaveSeenMasterRecently(MasterHistory localMasterHistory, boolean explain) {


The name of this method indicates it's looking for the time something happened - the when. Which is not correct AFAICS. Should we name it differently? ie. onSeenMaster / checkIfCurrentMasterIsStable ? Or something along those lines?

andreidan · 2022-05-18T11:49:23Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+     * @param explain Whether to calculate and include the details in the result
+     * @return The HealthIndicatorResult for the given localMasterHistory
+     */
+    private HealthIndicatorResult calculateWhenHaveNotSeenMasterRecently(MasterHistory localMasterHistory, boolean explain) {


As above, I think this should be named according to what it does/what it looks for as opposed to when it's called (the when in the name is confusing as it points to calculating the time when an event happened)

Maybe onNotSeenMaster / onNoLocalMaster or something similar?

andreidan · 2022-05-18T11:54:42Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+            || MasterHistory.hasMasterGoneNullAtLeastNTimes(remoteHistory, acceptableNullTransitions + 1)
+            || MasterHistory.getNumberOfMasterIdentityChanges(remoteHistory) > acceptableIdentityChanges;
+        if (masterConfirmedUnstable) {
+            if (localNodeIsMaster == false && remoteHistory == null) {


if remoteHistory is null but there's an in-flight request to get it I believe the status should still be GREEN?

Our design doc says we're OK with it being YELLOW here right?

andreidan · 2022-05-18T11:55:47Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+         *  or changed identity repeatedly, then we have a problem (the master has confirmed what the local node saw).
+         */
+        boolean masterConfirmedUnstable = localNodeIsMaster
+            || remoteHistory == null


if remoteHistory is null but there's an in-flight request to get it I believe the status should still be GREEN?

We had multiple offline conversations about this and decided to make two changes:

We're introducing the notion of a remote history becoming stale (Timing out stale remote master history #86936)

If the remote master history is unset or stale we'll return the status as GREEN

andreidan · 2022-05-18T11:59:35Z

@DaveCTurner what do you think about using the disruption tests infrastructure to add some integration tests for this indicator?

DaveCTurner · 2022-05-18T13:23:15Z

Yes we should have some integ tests for this.

I also suggest trying to add some tests that use the CoordinatorTests framework since that would let you simulate various disruptions plus the passage of time (which is important for this indicator) as well as being single-threaded and repeatable.

andreidan

Thanks for working on this Keith.

I left some comments.

I am wondering if it'd make sense to separate the "master stability diagnosis" part of this into its own service (that'll listen to cluster change events, and perform the needed diagnosis when requested).

I've recommended something similar for the shards_availability indicator and we should revise it there too (// cc @jbaiera )

Currently, the indicator is mixing the diagnosis logic with the representation of the diagnosis.
Modularising it a bit might aid with future requirements where we'd want (possibly) the diagnosing logic to be used in other transports/services.

What do you think?

andreidan · 2022-06-01T17:26:33Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+    private static final int DEFAULT_ACCEPTABLE_NULL_TRANSITIONS = 3;
+    private static final int SMALLEST_ALLOWED_ACCEPTABLE_NULL_TRANSITIONS = 0;


Would it make sense to move away from the acceptable/unacceptable language to having a single "threshold" concept?

eg. DEFAULT_NULL_TRANSITIONS_THRESHOLD = 4 (similar renaming for the corresponding setting and variable that reads it).
DEFAULT_IDENTITY_CHANGES_THRESHOD = 4 (similar renaming for the corresponding setting and variable that reads it).

I don't think we need constants for the min values

I was trying to keep the wording in line with the diagram we have. But I guess it really doesn't matter because that diagram is not meant to be long-lived.

andreidan · 2022-06-01T17:28:46Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+
+    // This is the default amount of time we look back to see if we have had a master at all, before moving on with other checks
+    private static final TimeValue DEFAULT_VERY_RECENT_PAST = new TimeValue(30, TimeUnit.SECONDS);
+    private static final TimeValue SMALLEST_ALLOWED_VERY_RECENT_PAST = new TimeValue(1, TimeUnit.SECONDS);


On the naming - would it make sense to have the naming be more specific ?
ie. DEFAULT_VERY_RECENT_PAST -> NODE_HAS_MASTER_LOOKUP_TIMEFRAME or something similar?

Same for the variables and corresponding setting?

andreidan · 2022-06-01T17:32:26Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+     * @return An empty HealthIndicatorDetails if explain is false, otherwise a HealthIndicatorDetails containing only "current_master"
+     * and "recent_masters"
+     */
+    private HealthIndicatorDetails getSimpleDetails(boolean explain, MasterHistory localMasterHistory) {


Would getDetails or getIndicatorDetails be more accurate? Not sure simple is adding information to the caller/method description.

What do you think?

andreidan · 2022-06-01T17:33:24Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+                for (DiscoveryNode recentMaster : recentMasters) {
+                    if (recentMaster != null) {
+                        builder.startObject();
+                        builder.field("node_id", recentMaster.getId());


Would printing the start timestamp be useful here too? (to indicate how quickly they changed)

There's no notion of timestamp here, mostly so that we don't leak out the notion of the machine's relative timestamps that don't mean anything outside of the JVM. You think we ought to add some public notion of time to https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/coordination/MasterHistory.java? That might be best in a couple of follow-up PRs?

++ Good point. Let's leave it as is for now and decide later (after we use the indicator a bit) if we need more details.

dakrone

I left some comments but this generally looks good!

dakrone · 2022-06-01T21:10:58Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+    private static final TimeValue SMALLEST_ALLOWED_VERY_RECENT_PAST = new TimeValue(1, TimeUnit.SECONDS);
+
+    // This is the default number of times that it is OK to have a master go null. Any more than this will be reported as a problem
+    private static final int DEFAULT_ACCEPTABLE_NULL_TRANSITIONS = 3;


I don't think we need all of these as private static variables, we can put the comment and the actual value (3 in this case) in the setting itself, which is already public and static?

I prefer to have the hard value with the setting itself, so if you look at the Setting you don't have to do another reference jump to see the default value.

dakrone · 2022-06-01T21:12:47Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+    private static final String UNSTABLE_MASTER_BACKUP_IMPACT = "Snapshot and restore will not work.";
+
+    public static final Setting<TimeValue> VERY_RECENT_PAST_SETTING = Setting.timeSetting(
+        "health.master_history.very_recent_past",


I don't think this is a very descriptive name (just "very_recent_past"), I'd prefer something like "stability_window" to (hopefully?) clarify a little bit more what it does from the name alone.

Changed to NODE_HAS_MASTER_LOOKUP_TIMEFRAME_SETTING based on feedback from Andrei.

dakrone · 2022-06-01T21:13:19Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+    );
+
+    public static final Setting<Integer> ACCEPTABLE_NULL_TRANSITIONS_SETTING = Setting.intSetting(
+        "health.master_history.acceptable_null_transitions",


I don't think we should "leak" the word/concept of 'null' here, perhaps "acceptable_no_master_transitions" or "acceptable_none_transitions"?

I changed the setting name and the setting variable name, but left internal variable names and method names as they were b/c using the word "null" really makes it easier to follow internally. Let me know if you disagree.

dakrone · 2022-06-01T21:15:28Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+        HealthStatus stableMasterStatus = HealthStatus.YELLOW;
+        String summary = String.format(
+            Locale.ROOT,
+            "The master has changed %d times in the last %s",


Suggested change

"The master has changed %d times in the last %s",

"The elected master node has changed %d times in the last %s",

dakrone · 2022-06-01T21:18:00Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+                }
+            });
+            return builder.endObject();
+        } : HealthIndicatorDetails.EMPTY;


I think we should limit ternary operators to single-line statements, since it's difficult to parse as a human when it spans tens of lines.

Perhaps we could do something like:

if (explain == false) { return HealthIndicatorDetails.EMPTY; } else { return <this big thing>; }

dakrone · 2022-06-01T21:22:16Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+                    || MasterHistory.getNumberOfMasterIdentityChanges(remoteHistory) >= unacceptableIdentityChanges));
+        if (masterConfirmedUnstable) {
+            logger.trace("The master node {} thinks it is unstable", master);
+            final HealthStatus stableMasterStatus = HealthStatus.YELLOW;


I don't think we need this variable since it's only used a single place below passed in to the createIndicator method?

dakrone · 2022-06-01T21:24:04Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+            final HealthStatus stableMasterStatus = HealthStatus.YELLOW;
+            String summary = String.format(
+                Locale.ROOT,
+                "The cluster's master has alternated between %s and no master multiple times in the last %s",


Suggested change

"The cluster's master has alternated between %s and no master multiple times in the last %s",

"The cluster's elected master node has alternated between %s and no elected master node multiple times in the last %s",

(I clarify here because I don't want to confuse a user between there being no master nodes in the cluster, and no elected master node in the cluster)

dakrone · 2022-06-01T21:25:26Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+        MasterHistory localMasterHistory,
+        @Nullable Exception remoteHistoryException
+    ) {
+        return explain ? (builder, params) -> {


Same comment here about keeping ternary things to single-line statements.

dakrone · 2022-06-01T21:26:00Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+        List<UserAction> userActions = List.of();
+        logger.trace("The cluster has a stable master node");
+        HealthIndicatorDetails details = getSimpleDetails(explain, localMasterHistory);
+        return createIndicator(stableMasterStatus, summary, details, impacts, userActions);


I think you can just pass HealthStatus.GREEN directly in here?

dakrone · 2022-06-01T21:26:15Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+        Collection<HealthIndicatorImpact> impacts = getUnstableMasterImpacts();
+        List<UserAction> userActions = getContactSupportUserActions(explain);
+        return createIndicator(
+            stableMasterStatus,


Same for passing HealthStatus.RED directly in here.

masseyke · 2022-06-02T20:04:27Z

I am wondering if it'd make sense to separate the "master stability diagnosis" part of this into its own service (that'll listen to cluster change events, and perform the needed diagnosis when requested).

I've recommended something similar for the shards_availability indicator and we should revise it there too (// cc @jbaiera )

Currently, the indicator is mixing the diagnosis logic with the representation of the diagnosis. Modularising it a bit might aid with future requirements where we'd want (possibly) the diagnosing logic to be used in other transports/services.

What do you think?

@andreidan it sounds reasonable, but likely a very big change. What do you think about considering that for a follow-up PR (or PRs)?

andreidan

LGTM, thanks for iterating on this Keith

Left a few nits but this looks great. Thanks for adding all the tests too ! 🚀

andreidan · 2022-06-06T12:23:35Z

server/src/internalClusterTest/java/org/elasticsearch/discovery/StableMasterDisruptionIT.java

    }

+    public void testRepeatedMasterIdentityChangesRecognizedAsUnstable() throws Exception {


Could testRepeatedMasterChanges be the test? It's documented as a utility method but only used in this test

andreidan · 2022-06-06T12:35:45Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+    private static final String UNSTABLE_MASTER_INGEST_IMPACT = "The cluster cannot create, delete, or rebalance indices, and cannot "
+        + "insert or update documents.";
+    private static final String UNSTABLE_MASTER_DEPLOYMENT_MANAGEMENT_IMPACT = "Scheduled tasks such as Watcher, ILM, and SLM will not "
+        + "work. The _cat APIs will not work.";
+    private static final String UNSTABLE_MASTER_BACKUP_IMPACT = "Snapshot and restore will not work.";


Impacts have slightly changed in the instance has master indicator since this PR was opened.

andreidan · 2022-06-06T12:36:25Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+
+    // This is the default amount of time we look back to see if we have had a master at all, before moving on with other checks
+    private static final TimeValue NODE_HAS_MASTER_LOOKUP_TIMEFRAME = new TimeValue(30, TimeUnit.SECONDS);
+    private static final TimeValue SMALLEST_ALLOWED_HAS_MASTER_LOOKUP_TIMEFRAME = new TimeValue(1, TimeUnit.SECONDS);


nit: Should this be MIN_MASTER_LOOKUP_TIMEFRAME or hardcode the values in the setting definitions like we do with the others?

andreidan · 2022-06-06T12:38:46Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+                for (DiscoveryNode recentMaster : recentMasters) {
+                    if (recentMaster != null) {
+                        builder.startObject();
+                        builder.field("node_id", recentMaster.getId());


++ Good point. Let's leave it as is for now and decide later (after we use the indicator a bit) if we need more details.

andreidan · 2022-06-06T12:40:54Z

...src/main/java/org/elasticsearch/cluster/coordination/StableMasterHealthIndicatorService.java

+     * This returns true if this node has seen a master node within the last few seconds
+     * @return true if this node has seen a master node within the last few seconds, false otherwise
+     */
+    private boolean hasSeenMasterInVeryRecentPast() {


nit: now that we renamed the setting to be more precise, should we rename this method too to reflect that it's looking in the configured "had master lookup timeframe" ?

… service (#87482) This builds on #86524 by supporting two additional conditions, both of which happen when there has been no elected master for more than 30 seconds (from the queried node's point of view), and both of which return a RED status: (1) There are no master-eligible nodes found in the cluster (2) The node being queried sees a master-eligible node that has been elected the master, but cannot join it

Initial commit for master stability health indicator (when a master h…

5198ff2

…as been seen recently)

masseyke added >feature :Data Management/Health v8.3.0 labels May 6, 2022

Update docs/changelog/86524.yaml

b477969

masseyke added 8 commits May 6, 2022 11:03

reverting accidental change

c97ff8d

Merge branch 'feature/health-api-master-stability-indicator' of githu…

75f5f78

…b.com:masseyke/elasticsearch into feature/health-api-master-stability-indicator

Replacing InstanceHasMasterHealthIndicatorService

08f0fee

merging master

7156054

Renamed includeDetails to explain, and fixed integration tests

9ee8abe

cleaning up

80c2b00

cleanup

ddb6983

cleanup

bfc6dde

chaning version on yaml test

4a59bc8

masseyke marked this pull request as ready for review May 11, 2022 17:11

elasticmachine added the Team:Data Management Meta label for data/management team label May 11, 2022

sethmlarson added the Team:Clients Meta label for clients team label May 11, 2022

masseyke marked this pull request as draft May 11, 2022 18:35

masseyke marked this pull request as ready for review May 11, 2022 21:16

andreidan reviewed May 18, 2022

View reviewed changes

masseyke added 4 commits May 18, 2022 10:13

code review feedback

20a9036

code review feedback

e5e647a

Adding tests

63e000e

Returning status of GREEN on null remote master history

f98f18c

masseyke added 5 commits May 25, 2022 13:49

fixing bad merge

be3c0ef

Merge branch 'master' into feature/health-api-master-stability-indicator

7e65738

fixing merge after 8.3.0 release

f0b9f68

fixing merge after 8.3.0 release

738a168

removing accidental commit

3a14aa7

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

masseyke requested a review from andreidan May 25, 2022 22:52

Merge branch 'master' into feature/health-api-master-stability-indicator

0102cdd

andreidan reviewed Jun 1, 2022

View reviewed changes

andreidan requested a review from dakrone June 1, 2022 17:47

code review feedback

07bb192

dakrone reviewed Jun 1, 2022

View reviewed changes

code review feedback

4611752

masseyke requested review from dakrone and andreidan June 2, 2022 19:55

fixing stability integration test

360af7b

masseyke mentioned this pull request Jun 3, 2022

Making HealthIndicatorResult Writeable so it can be used in a transport action #87388

Closed

fixing stability of integration test

c3f8838

andreidan approved these changes Jun 6, 2022

View reviewed changes

code review feedback

9d4706a

masseyke merged commit c95230d into elastic:master Jun 6, 2022

masseyke deleted the feature/health-api-master-stability-indicator branch June 6, 2022 21:07

This was referenced Jun 7, 2022

Adding additional capability to the master_is_stable health indicator service #87482

Merged

Cluster coordination indicator - report if the master is stable and an impact/troubleshoot guide otherwise #85624

Closed

masseyke mentioned this pull request Jun 29, 2022

Adding logic to master_is_stable indicator to check for discovery problems #88020

Merged

masseyke mentioned this pull request Aug 9, 2022

Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

Merged

		private static final int DEFAULT_ACCEPTABLE_NULL_TRANSITIONS = 3;
		private static final int SMALLEST_ALLOWED_ACCEPTABLE_NULL_TRANSITIONS = 0;

	"The master has changed %d times in the last %s",
	"The elected master node has changed %d times in the last %s",

	"The cluster's master has alternated between %s and no master multiple times in the last %s",
	"The cluster's elected master node has alternated between %s and no elected master node multiple times in the last %s",

		}

		public void testRepeatedMasterIdentityChangesRecognizedAsUnstable() throws Exception {

Master stability health indicator part 1 (when a master has been seen recently) #86524

Master stability health indicator part 1 (when a master has been seen recently) #86524

Conversation

masseyke commented May 6, 2022 • edited Loading

elasticsearchmachine commented May 6, 2022

masseyke commented May 11, 2022

elasticmachine commented May 11, 2022

elasticmachine commented May 11, 2022

andreidan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented May 18, 2022

DaveCTurner commented May 18, 2022

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masseyke commented Jun 2, 2022

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masseyke commented May 6, 2022 •

edited

Loading

andreidan left a comment •

edited

Loading