Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

masseyke · 2022-08-09T16:48:54Z

This PR builds on #86524, #87482, and #87306 by supporting the case where there has been no master node in the last 30 second, no node has been elected master, and the current node is not master eligible. This is branch 1.2.2.3 in the diagram at #87482 (comment).
The outline of the logic is that when we see that the master node has gone null, we start polling a random master-eligible node for its CoordinationDiagnosticsResult. Once a diagnoseMasterStability() request comes in we look at the CoordinationDiagnosticsResult from the master eligible node, and the result is one of the following:

We do not have a result yet from a master-eligible node, so we report that with a RED status (not in the diagram).
We have received a non-GREEN status from the master eligible node, and we return that status (1.2.2.3.1)
We have received a GREEN status from the master eligible node, and we return RED because something is wrong with discovery (1.2.2.3.2)
We have timed out or received some exception trying to get the remote master-eligible node's result, and we return RED (1.2.2.3.3).

(1) No result yet:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[4061946227131368569]-HASH=[3A9A1FBC202B3]-cluster",
    "indicators": {
        "master_is_stable": {
            "status": "red",
            "symptom": "No master node observed in the last 1s, and this node is not master eligible. Reaching out to a master-eligible node for more information, but no result yet.",
            "details": {
                "current_master": {
                    "node_id": null,
                    "name": null
                }
            },
            "impacts": [...],
            "diagnosis": [
                {
                    "cause": "The Elasticsearch cluster does not have a stable master node.",
                    "action": "Get help at https://ela.st/getting-help",
                    "help_url": "https://ela.st/getting-help"
                }
            ]
        },
        ...
    }
}

(2) Non-green status:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[4061946227131368569]-HASH=[3A9A1FBC202B3]-cluster",
    "indicators": {
        "master_is_stable": {
            "status": "red",
            "symptom": "No master node observed in the last 1s, and the master eligible nodes are unable to form a quorum",
            "details": {
                "current_master": {
                    "node_id": null,
                    "name": null
                },
                "recent_masters": [
                    {
                        "node_id": "e2ljvaXGRk6ysYlGnbJ5xg",
                        "name": "node_t0"
                    }
                ],
                "cluster_formation": {
                    "br6axD4fT0Osdz2ZafjB7A": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [0dVvMncjRviHN16CoeEOCQ, br6axD4fT0Osdz2ZafjB7A, e2ljvaXGRk6ysYlGnbJ5xg], have only discovered non-quorum [{node_t1}{br6axD4fT0Osdz2ZafjB7A}{QcxUxkP3QjSSZVmdfCmi-A}{node_t1}{127.0.0.1}{127.0.0.1:13303}{m}]; discovery will continue using [127.0.0.1:13302, 127.0.0.1:13301] from hosts providers and [{node_t0}{e2ljvaXGRk6ysYlGnbJ5xg}{cQZ6QvlXQpend79JaIgorw}{node_t0}{127.0.0.1}{127.0.0.1:13302}{m}, {node_t1}{br6axD4fT0Osdz2ZafjB7A}{QcxUxkP3QjSSZVmdfCmi-A}{node_t1}{127.0.0.1}{127.0.0.1:13303}{m}, {node_t2}{0dVvMncjRviHN16CoeEOCQ}{oUeLN3g_Si6MsBxIUgR9Xg}{node_t2}{127.0.0.1}{127.0.0.1:13301}{m}] from last-known cluster state; node term 1, last-accepted version 6 in term 1"
                }
            },
            "impacts": [...],
            "diagnosis": [...]
        },
        ...
    }
}

(3) Green remote status:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[4061946227131368569]-HASH=[3A9A1FBC202B3]-cluster",
    "indicators": {
        "master_is_stable": {
            "status": "red",
            "symptom": "No master node observed in the last 1s from this node, but node_t2 reports that the status is GREEN. This indicates that there is a discovery problem on node_t4",
            "details": {
                "current_master": {
                    "node_id": "e2ljvaXGRk6ysYlGnbJ5xg",
                    "name": "node_t0"
                },
                "recent_masters": [
                    {
                        "node_id": "e2ljvaXGRk6ysYlGnbJ5xg",
                        "name": "node_t0"
                    }
                ],
            },
            "impacts": [...],
            "diagnosis": [...]
        },
        ...
    }
}

(4) Timeout exception:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[4061946227131368569]-HASH=[3A9A1FBC202B3]-cluster",
    "indicators": {
        "master_is_stable": {
            "status": "red",
            "symptom": "No master node observed in the last 1s from this node, and received an exception while reaching out to node_t2 for diagnosis",
            "details": {
                "current_master": {
                    "node_id": null,
                    "name": null
                },
                "recent_masters": [
                    {
                        "node_id": "e2ljvaXGRk6ysYlGnbJ5xg",
                        "name": "node_t0"
                    }
                ],
                "exception_fetching_history": {
                    "message": "[node_t2][127.0.0.1:13301][internal:cluster/coordination_diagnostics/info] request_id [32] timed out after [13143ms]",
                    "stack_trace": "org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_t2][127.0.0.1:13301][internal:cluster/coordination_diagnostics/info] request_id [32] timed out after [13143ms]\n"
                }
            },
            "impacts": [...],
            "diagnosis": [...]
        },
        ...
    }
}

…ster and the current node is not master eligible

elasticsearchmachine · 2022-08-09T16:49:18Z

Hi @masseyke, I've created a changelog YAML for you.

elasticsearchmachine · 2022-08-09T19:34:18Z

Pinging @elastic/es-data-management (Team:Data Management)

andreidan · 2022-08-11T12:54:03Z

@masseyke Just a note - in the description it says and the current node is master eligible - this is not correct. The branch we're working on now is when the current node is NOT master eligible

andreidan

Thanks for working on this Keith

Left a few minor suggestions

andreidan · 2022-08-11T13:46:22Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+            summary = String.format(
+                Locale.ROOT,
+                "No master node observed in the last %s, and this node is not master eligible. Reaching out to a master-eligible node"
+                    + " for more information, but no result yet.",


Should we say we couldn't further diagnose as we couldn't reach to a master eligible node?

the "no result yet" bit implies we're still working on it or that a different result is coming somehow?

It's polling random master eligible nodes for information, and we will get a result from one in the future (even if that result is TimedOutException or equivalent). If there were no master eligible nodes at all we would have reported that in another branch. The only way we'd get here and not have a result coming in the future is if we have a bug in the code (like the one fixed by #89014 -- it actually leads to this message being given with no result ever coming from a remote node).

Ah sure. I'm not sure we want to propagate this message in the health API though as the API is request/response. With that in mind I think it'd make sense to tell the API user "we couldn't gather more information" as opposed to an ominous "not yet" :)
I understand that the diagnostics service is a continuously running service however, we seem to pass on this summary in the health API so we either:

make sure the message makes sense in the API too (which is what I was proposing)

keep the message as is (makes sense in the context of the diagnostics service) but we parse and rewrite it in the API to make sennse there.

Ha I thought the "not yet" made it more optimistic (check back later and we might know more) than more ominous, but I can drop it.

andreidan · 2022-08-11T13:46:54Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+            if (explain) {
+                details = CoordinationDiagnosticsDetails.EMPTY;
+            } else {
+                details = CoordinationDiagnosticsDetails.EMPTY;
+            }


details always stay empty here. Do we want to return the "local details" if explain is true? ie. the details section with the local master history and such?

Yeah might as well -- based on the if/else doing the same thing I can only assume I must have intended to do that and then... I'm not sure why I didn't.

andreidan · 2022-08-11T13:49:09Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+                    details = CoordinationDiagnosticsDetails.EMPTY;
+                }
+            } else {
+                // It should not be possible to get here


Should this branch even exist?

I could change it to throw an AssertionError instead I guess. I just figured that eventually somehow we'd get into the impossible situation, and better to return something.

Actually how about I assert that we're not there so that it'll fail if we have assertions enabled, but I'll return what I'm currently returning so that it won't blow up on the user if we've somehow introduced a bug and missed it in testing.

But what is the else ?

RemoteMasterHealthResult must have a result or an exception, if this is not enforced, let's enforced it in the constructor there ?

andreidan

LGTM thanks Keith

Adding a check to the master stability health API when there is no ma…

5be323f

…ster and the current node is not master eligible

masseyke added >enhancement :Data Management/Health v8.5.0 labels Aug 9, 2022

Update docs/changelog/89219.yaml

d5b8f09

masseyke marked this pull request as ready for review August 9, 2022 19:33

masseyke requested a review from andreidan August 9, 2022 19:34

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 9, 2022

cleaning up tests

a28b6c0

andreidan reviewed Aug 11, 2022

View reviewed changes

masseyke added 3 commits August 11, 2022 09:27

code review feedback

1c55c2c

code review feedback

37e291d

merging main

2b3826a

masseyke requested a review from andreidan August 11, 2022 15:05

masseyke added 3 commits August 11, 2022 12:15

fixing unit tests

7a4c877

code review feedback

a3e8156

code review feedback

994c85f

andreidan approved these changes Aug 12, 2022

View reviewed changes

masseyke merged commit 5a26455 into elastic:main Aug 12, 2022

masseyke deleted the feature/health-api-master-stability-not-master branch August 12, 2022 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

masseyke commented Aug 9, 2022 •

edited

Loading

elasticsearchmachine commented Aug 9, 2022

elasticsearchmachine commented Aug 9, 2022

andreidan commented Aug 11, 2022

andreidan left a comment

andreidan Aug 11, 2022

masseyke Aug 11, 2022

andreidan Aug 11, 2022

masseyke Aug 11, 2022

andreidan Aug 11, 2022

masseyke Aug 11, 2022

andreidan Aug 11, 2022

masseyke Aug 11, 2022

masseyke Aug 11, 2022

andreidan Aug 11, 2022

andreidan left a comment •

edited

Loading

Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

Conversation

masseyke commented Aug 9, 2022 • edited Loading

elasticsearchmachine commented Aug 9, 2022

elasticsearchmachine commented Aug 9, 2022

andreidan commented Aug 11, 2022

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan left a comment • edited Loading

Choose a reason for hiding this comment

masseyke commented Aug 9, 2022 •

edited

Loading

andreidan left a comment •

edited

Loading