Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a check to the master stability health API when there is no master and the current node is not master eligible #89219

Merged

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Aug 9, 2022

This PR builds on #86524, #87482, and #87306 by supporting the case where there has been no master node in the last 30 second, no node has been elected master, and the current node is not master eligible. This is branch 1.2.2.3 in the diagram at #87482 (comment).
The outline of the logic is that when we see that the master node has gone null, we start polling a random master-eligible node for its CoordinationDiagnosticsResult. Once a diagnoseMasterStability() request comes in we look at the CoordinationDiagnosticsResult from the master eligible node, and the result is one of the following:

  1. We do not have a result yet from a master-eligible node, so we report that with a RED status (not in the diagram).
  2. We have received a non-GREEN status from the master eligible node, and we return that status (1.2.2.3.1)
  3. We have received a GREEN status from the master eligible node, and we return RED because something is wrong with discovery (1.2.2.3.2)
  4. We have timed out or received some exception trying to get the remote master-eligible node's result, and we return RED (1.2.2.3.3).

(1) No result yet:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[4061946227131368569]-HASH=[3A9A1FBC202B3]-cluster",
    "indicators": {
        "master_is_stable": {
            "status": "red",
            "symptom": "No master node observed in the last 1s, and this node is not master eligible. Reaching out to a master-eligible node for more information, but no result yet.",
            "details": {
                "current_master": {
                    "node_id": null,
                    "name": null
                }
            },
            "impacts": [...],
            "diagnosis": [
                {
                    "cause": "The Elasticsearch cluster does not have a stable master node.",
                    "action": "Get help at https://ela.st/getting-help",
                    "help_url": "https://ela.st/getting-help"
                }
            ]
        },
        ...
    }
}

(2) Non-green status:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[4061946227131368569]-HASH=[3A9A1FBC202B3]-cluster",
    "indicators": {
        "master_is_stable": {
            "status": "red",
            "symptom": "No master node observed in the last 1s, and the master eligible nodes are unable to form a quorum",
            "details": {
                "current_master": {
                    "node_id": null,
                    "name": null
                },
                "recent_masters": [
                    {
                        "node_id": "e2ljvaXGRk6ysYlGnbJ5xg",
                        "name": "node_t0"
                    }
                ],
                "cluster_formation": {
                    "br6axD4fT0Osdz2ZafjB7A": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [0dVvMncjRviHN16CoeEOCQ, br6axD4fT0Osdz2ZafjB7A, e2ljvaXGRk6ysYlGnbJ5xg], have only discovered non-quorum [{node_t1}{br6axD4fT0Osdz2ZafjB7A}{QcxUxkP3QjSSZVmdfCmi-A}{node_t1}{127.0.0.1}{127.0.0.1:13303}{m}]; discovery will continue using [127.0.0.1:13302, 127.0.0.1:13301] from hosts providers and [{node_t0}{e2ljvaXGRk6ysYlGnbJ5xg}{cQZ6QvlXQpend79JaIgorw}{node_t0}{127.0.0.1}{127.0.0.1:13302}{m}, {node_t1}{br6axD4fT0Osdz2ZafjB7A}{QcxUxkP3QjSSZVmdfCmi-A}{node_t1}{127.0.0.1}{127.0.0.1:13303}{m}, {node_t2}{0dVvMncjRviHN16CoeEOCQ}{oUeLN3g_Si6MsBxIUgR9Xg}{node_t2}{127.0.0.1}{127.0.0.1:13301}{m}] from last-known cluster state; node term 1, last-accepted version 6 in term 1"
                }
            },
            "impacts": [...],
            "diagnosis": [...]
        },
        ...
    }
}

(3) Green remote status:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[4061946227131368569]-HASH=[3A9A1FBC202B3]-cluster",
    "indicators": {
        "master_is_stable": {
            "status": "red",
            "symptom": "No master node observed in the last 1s from this node, but node_t2 reports that the status is GREEN. This indicates that there is a discovery problem on node_t4",
            "details": {
                "current_master": {
                    "node_id": "e2ljvaXGRk6ysYlGnbJ5xg",
                    "name": "node_t0"
                },
                "recent_masters": [
                    {
                        "node_id": "e2ljvaXGRk6ysYlGnbJ5xg",
                        "name": "node_t0"
                    }
                ],
            },
            "impacts": [...],
            "diagnosis": [...]
        },
        ...
    }
}

(4) Timeout exception:

{
    "status": "red",
    "cluster_name": "TEST-TEST_WORKER_VM=[--not-gradle--]-CLUSTER_SEED=[4061946227131368569]-HASH=[3A9A1FBC202B3]-cluster",
    "indicators": {
        "master_is_stable": {
            "status": "red",
            "symptom": "No master node observed in the last 1s from this node, and received an exception while reaching out to node_t2 for diagnosis",
            "details": {
                "current_master": {
                    "node_id": null,
                    "name": null
                },
                "recent_masters": [
                    {
                        "node_id": "e2ljvaXGRk6ysYlGnbJ5xg",
                        "name": "node_t0"
                    }
                ],
                "exception_fetching_history": {
                    "message": "[node_t2][127.0.0.1:13301][internal:cluster/coordination_diagnostics/info] request_id [32] timed out after [13143ms]",
                    "stack_trace": "org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_t2][127.0.0.1:13301][internal:cluster/coordination_diagnostics/info] request_id [32] timed out after [13143ms]\n"
                }
            },
            "impacts": [...],
            "diagnosis": [...]
        },
        ...
    }
}

…ster and the current node is not master eligible
@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke masseyke marked this pull request as ready for review August 9, 2022 19:33
@masseyke masseyke requested a review from andreidan August 9, 2022 19:34
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 9, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@andreidan
Copy link
Contributor

@masseyke Just a note - in the description it says and the current node is master eligible - this is not correct. The branch we're working on now is when the current node is NOT master eligible

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith

Left a few minor suggestions

summary = String.format(
Locale.ROOT,
"No master node observed in the last %s, and this node is not master eligible. Reaching out to a master-eligible node"
+ " for more information, but no result yet.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say we couldn't further diagnose as we couldn't reach to a master eligible node?

the "no result yet" bit implies we're still working on it or that a different result is coming somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's polling random master eligible nodes for information, and we will get a result from one in the future (even if that result is TimedOutException or equivalent). If there were no master eligible nodes at all we would have reported that in another branch. The only way we'd get here and not have a result coming in the future is if we have a bug in the code (like the one fixed by #89014 -- it actually leads to this message being given with no result ever coming from a remote node).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure. I'm not sure we want to propagate this message in the health API though as the API is request/response. With that in mind I think it'd make sense to tell the API user "we couldn't gather more information" as opposed to an ominous "not yet" :)
I understand that the diagnostics service is a continuously running service however, we seem to pass on this summary in the health API so we either:

  1. make sure the message makes sense in the API too (which is what I was proposing)
  2. keep the message as is (makes sense in the context of the diagnostics service) but we parse and rewrite it in the API to make sennse there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha I thought the "not yet" made it more optimistic (check back later and we might know more) than more ominous, but I can drop it.

Comment on lines 402 to 406
if (explain) {
details = CoordinationDiagnosticsDetails.EMPTY;
} else {
details = CoordinationDiagnosticsDetails.EMPTY;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

details always stay empty here. Do we want to return the "local details" if explain is true? ie. the details section with the local master history and such?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah might as well -- based on the if/else doing the same thing I can only assume I must have intended to do that and then... I'm not sure why I didn't.

details = CoordinationDiagnosticsDetails.EMPTY;
}
} else {
// It should not be possible to get here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this branch even exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change it to throw an AssertionError instead I guess. I just figured that eventually somehow we'd get into the impossible situation, and better to return something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually how about I assert that we're not there so that it'll fail if we have assertions enabled, but I'll return what I'm currently returning so that it won't blow up on the user if we've somehow introduced a bug and missed it in testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what is the else ?

RemoteMasterHealthResult must have a result or an exception, if this is not enforced, let's enforced it in the constructor there ?

@masseyke masseyke requested a review from andreidan August 11, 2022 15:05
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks Keith

@masseyke masseyke merged commit 5a26455 into elastic:main Aug 12, 2022
@masseyke masseyke deleted the feature/health-api-master-stability-not-master branch August 12, 2022 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants