Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metric to show whether a server is a leader or follower #13169

Closed
dpw opened this issue May 20, 2022 · 1 comment · Fixed by #13304
Closed

Prometheus metric to show whether a server is a leader or follower #13169

dpw opened this issue May 20, 2022 · 1 comment · Fixed by #13304
Assignees

Comments

@dpw
Copy link

dpw commented May 20, 2022

Feature Description

There should be a consul server prometheus metric showing whether a server considers itself to be a leader, follower (or neither, in the case of a candidate). This should be possible for the current time (i.e. as of the last prometheus scrape) or for an arbitrary point in the past. Currently there is no straightforward way to determine this.

Furthermore, there should be a metric dedicated to this purpose and stable in future versions of consul (rather than being a metric for some other purpose that indicates the leader as a side effect, and so is liable to change).

Use Case(s)

For normal operation of a consul server cluster, there should be exactly one leader server, and all other servers should be followers. It should be possible to monitor that these conditions are satisfied, and alert if not, by means of simple prometheus query expressions.

Non-solutions

At first glance, it looks like the consul_raft_state_* metrics offer this. But those are counters that increment upon entry to the relevant state. So their values at a point in time do not show the leader and followers. For example, if a server reports a non-zero value of consul_raft_state_leader that means it became leader at some point, but it does not tell you that it is the leader now. (These counters do not even reliably tell the outcome of an election, as multiple elections may occur within a single prometheus scrape interval.)

In the past, there were gauge metrics that suggested the leader by their presence, for instance consul_raft_apply and consul_autopilot_healthy. But because those were only updated on the leader, when a server ceased to be leader they would contain stale values for a time controlled by the telemetry.prometheus_retention_period config setting. Furthermore, subsequent commits mean that those metrics no longer indicate the leader (#9198 exposed consul_raft_apply on every node; #12617 exposed consul_autopilot_healthy on every server).

While there are counter metrics that only increase on the leader, using them to reliably determine the leader requires a very cumbersome prometheus query expression (especially if the case of a standalone consul server is handled).

@huikang huikang self-assigned this May 27, 2022
@huikang
Copy link
Contributor

huikang commented May 27, 2022

Hi, @dpw , thanks for reporting and investigating this issue. You analysis totally makes sense; will work on the improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants