Add metric definitions for all metrics known at Consul start #9198

mkcp · 2020-11-14T00:38:07Z

I went through to every metric written in Consul that I could find and created metric definitions for them. I also declare each raft metric that we consider a "key metric" for Consul - though we'll want to migrate those out to the raft lib in a future patch.

I then went back over the telemetry.mdx list to grab the help text and ensure I got all of our metrics. I don't guarantee that this list is complete, the search list is long and a few may have eluded me that weren't in telemetry.mdx, but it's pretty dang close. Everything that's critical for monitoring Consul is present. Any discrepancies I found between telemetry.mdx and Consul's codebase are documented here #9197. To keep the scope of this manageable: any metrics missing from telemetry.mdx I did not give a help text, nor did I delete any "stale" metrics in telemetry.mdx that no longer appear in Consul's codebase.

One important note for review: beside all of the definitions themselves, I do add functionality in the agent/setup.go file. There I reference all of our metrics in Consul and add the service namespace to them. Iterating over each metric to add the namespace in the agent prevents us from having to add the namespace to every definition by hand, leading to a difference between defining and using the metrics (see #9182 for the bug caused by putting the namespace in when writing the metric).

Here's the result! Beautiful metrics with beautiful helptext, guaranteed to be present throughout Consul's lifecycle, even if we haven't measured them within the prometheus_retention_time.

(Finally 😉) Resolves #5140

…ry consul metric

…selves

dnephin

Nice work!

I mostly looked at the plumbing that reads in all the definitions, I haven't looked at the definitions themselves.

Left a couple concerns/suggestions

dnephin · 2020-11-16T18:29:32Z

lib/telemetry.go

+// EmptyPrometheusDefs returns a PrometheusDefs struct where each of the slices have zero elements, but not nil.
+func EmptyPrometheusDefs() PrometheusDefs {
+	return PrometheusDefs{
+		Gauges:    []prometheus.GaugeDefinition{},
+		Counters:  []prometheus.CounterDefinition{},
+		Summaries: []prometheus.SummaryDefinition{},
+	}
+}


The zero value for structs and slices is usable without being initialized, so I don't think this function is necessary. The one place that uses it can use lib.PrometheusDefs{} which is equivalent, and avoids allocating memory for the slice headers.

Or, with my suggestion to remove that type, the other caller can pass in a TelemetryConfig with the PrometheusOpts field that uses the zero value.

That is good to know - I was a bit concerned to pass a nil into go-metrics where I have less ability to make changes. Thankfully go does what you'd want here for once and doesn't panic w/ ranges over nil slices 😂. Seems a bit inconsistent compared to nil elsewhere, but I prefer this behavior so I'll gladly take it.

dnephin · 2020-11-16T18:36:47Z

lib/telemetry.go

 // InitTelemetry configures go-metrics based on map of telemetry config
 // values as returned by Runtimecfg.Config().
-func InitTelemetry(cfg TelemetryConfig) (*metrics.InmemSink, error) {
+func InitTelemetry(cfg TelemetryConfig, defs PrometheusDefs) (*metrics.InmemSink, error) {


Generally if a function accepts a config or options struct all the config would be part of that struct. This convention seems appropriate here as well.

Instead of a new PrometheusDefs struct, TelemetryConfig can use a PrometheusOpts field with type prometheus.PrometheusOpts. This would remove the need for the PrometheusDefs which closely mirrors an existing struct, and would also also use to remove the PrometheusRetentionTime field, since that field is already in PrometheusOpts.

Yeah I agree w/ that - the prometheus defs were a workaround to avoid adding an extra arg for each of the definition slices. I was a bit concerned about RuntimeConfig not mirroring the config we get in from disk like it does in many places. But adding the PrometheusOpts on there and translating it in builder.go does cut out of a bunch of boilerplate here.

dnephin · 2020-11-16T18:52:20Z

agent/setup.go

+		// Set Consul to each definition's namespace
+		var withService []prometheus.GaugeDefinition
+		for _, gauge := range g {
+			gauge.Name = append(serviceName, gauge.Name...)


I believe there are two problems here:

this assumes that the TelemetryConfig.MetricsPrefix will always be consul. That is the default value, but it can be changed by the user. I believe this will initialize the wrong metrics if a user specifies a different metrics prefix.

append does not modify the slice passed in as the first argument, but it can modify the underlying array used by that slice if the array is not at capacity. This is a subtle thing that can lead to very strange bugs. In this case the array backing serviceName could be used for every gauge.Name. So when the second iteration runs it will modify the value for the first iteration. Every gauge could end up with the same name, so only the very last one will get initialized. I believe it works currently because the array is being created with a capacity of 1, but to fix the first issue, that likely will not work anymore.

This is an example of the problem: https://play.golang.org/p/QFxtyGXrlqw

I believe the convention when using append is assign the result of append to the variable passed as the first argument. There are only a few rare cases when it can be assigned to another variable.

Since go-metrics is the code that adds the prefix to the name it seems like it should also handle prepending this name. Otherwise every caller will have to guess at how go-metrics is building up the final metric name.

Yes, this should absolutely be using TelemetryConfig.MetricsPrefix that is a bug.

With the slice prepends, I agree that these should be handled by go-metrics -- I think for this release it's ok if we handle them here so long as we write an issue for it. (I've added some todos as well) It would be impossible to assign the var back to the first var and still be performing a prepend operation, we'd be appending in that case. It looks like if we allocate a new literal for each prepend, we avoid any potential bugs. https://play.golang.org/p/F_zPCegmka7

Here's the prometheus page with a different metrics_prefix supplied and using the literals in the commit following this comment.

…teral to avoid bugs from append modifying its first arg

mkcp · 2020-11-16T22:13:58Z

CI's failing for real, gotta fix the runtime_config test...

…hub.com/hashicorp/consul into mkcp/telemetry/add-all-metric-definitions

mkcp · 2020-11-16T23:50:21Z

Done wrestling with the runtime and lib config tests. I worked around TelemetryConfig.MergeDefaults not supporting deep-merges in structs by checking for the correct struct type and leaving it alone. We don't have any prometheus config so this is fine™️ for now.

dnephin

LGTM!

Couple very minor suggestions for follow ups, but nothing blocking

dnephin · 2020-11-16T23:48:06Z

lib/telemetry.go

+		case reflect.Struct:
+			if f.Type() == reflect.TypeOf(prometheus.PrometheusOpts{}) {
+				continue
+			}


I couldn't find any callers of TelemetryConfig.MergeDefaults in OSS or Ent. I suspect this MergeDefaults function is leftover from a while ago. The only place I see it called is a test, which is I guess what failed and prompted this change. It looks like the only non-test call to it was removed in 65be587

Not urgent right now, but in a follow up we could delete this method I think.

Adding a note comment

dnephin · 2020-11-16T23:50:20Z

.changelog/9198.txt

@@ -0,0 +1,3 @@
+```release-note:improvement
+server: All metrics should be present and available to prometheus scrapers when Consul starts. If any non-deprecated metrics are missing please submit an issue with its name.


very minor: I know we aren't always consistent with our categories, but I wonder if agent: would be more appropriate than server:.

Makes sense - client agents are affected too

hashicorp-ci · 2020-11-16T23:55:16Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/283165.

hashicorp-ci · 2020-11-17T00:13:50Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/283196.

hashicorp-ci · 2020-11-17T00:13:55Z

🍒✅ Cherry pick of commit d15b6fd onto release/1.9.x succeeded!

…-definitions Add metric definitions for all metrics known at Consul start

m1keil · 2020-12-14T23:01:16Z

Followup question - does this changes renders prometheus_retention_time meaningless beyond the impact of enabling Prometheus metrics?

pierresouchay · 2020-12-28T23:23:22Z

No, prometheus_retention_time is still licit with his value as computed values are cleaned when retention time is reached.

Found a few warning in Consul when gathering the metrics however: #9471

pierresouchay · 2020-12-29T15:16:46Z

@mkcp This patch did create a few issues in Consul 1.9.x as described in #9471 , but I think I have a patch in hashicorp/go-metrics#122

mkcp added 3 commits November 12, 2020 18:12

first pass on agent-configured prometheusDefs and adding defs for eve…

24a2471

…ry consul metric

add the service name in the agent rather than in the definitions them…

06d59c0

…selves

finish adding static server metrics

5da2f1e

mkcp added type/bug Feature does not function as expected type/enhancement Proposed improvement or new feature theme/telemetry Anything related to telemetry or observability labels Nov 14, 2020

mkcp added this to the 1.9.0 milestone Nov 14, 2020

mkcp requested a review from a team November 14, 2020 00:38

hashibot-web deployed to Netlify Deploy Preview November 14, 2020 00:45 View deployment

merge master

3966ecb

dnephin reviewed Nov 16, 2020

View reviewed changes

hashibot-web deployed to Netlify Deploy Preview November 16, 2020 18:55 View deployment

mkcp added 4 commits November 16, 2020 11:02

trim help strings to save a few bytes

15af5ea

push prometheus sink definiitons into prometheus.PrometheusOpts

5e0e409

use the MetricsPrefix to set the service name and provide as slice li…

b81edac

…teral to avoid bugs from append modifying its first arg

prometheussink has the same number of params again

49f017b

mkcp added 5 commits November 16, 2020 14:16

update runtime_test to handle PrometheusOpts expiry field change

2fe021f

linting: sort and group import

fc30f07

fix some tests that were broken from the TelemetryConfig change

ad4cebc

Merge branch 'mkcp/telemetry/add-all-metric-definitions' of ssh://git…

8e554ee

…hub.com/hashicorp/consul into mkcp/telemetry/add-all-metric-definitions

add changelog entry

eda553e

dnephin approved these changes Nov 16, 2020

View reviewed changes

mkcp added 2 commits November 16, 2020 15:53

add note about deleting TelemetryConfig.MergeDefaults in the future

bd0c7c2

changelog component should mention agent not just server

52c53b2

mkcp merged commit d15b6fd into master Nov 16, 2020

mkcp deleted the mkcp/telemetry/add-all-metric-definitions branch November 16, 2020 23:54

mkcp added the backport/1.9 label Nov 17, 2020

hashicorp-ci pushed a commit that referenced this pull request Nov 17, 2020

Merge pull request #9198 from hashicorp/mkcp/telemetry/add-all-metric…

82e7363

…-definitions Add metric definitions for all metrics known at Consul start

mkcp restored the mkcp/telemetry/add-all-metric-definitions branch November 17, 2020 00:24

mkcp added a commit that referenced this pull request Nov 17, 2020

Merge pull request #9198 from hashicorp/mkcp/telemetry/add-all-metric…

88b013b

…-definitions Add metric definitions for all metrics known at Consul start

mkcp deleted the mkcp/telemetry/add-all-metric-definitions branch November 17, 2020 00:29

nahratzah mentioned this pull request Nov 17, 2020

inability to remember metrics #9208

Open

lawliet89 mentioned this pull request Nov 25, 2020

consul_autopilot_healthy metric is NaN #9274

Closed

pierresouchay mentioned this pull request Jan 5, 2021

Use the same Help for metrics with and without labels hashicorp/go-metrics#122

Merged

jkirschner-hashicorp mentioned this pull request Aug 20, 2021

Best way to monitor cluster leader presence/absence #10733

Open

Amier3 mentioned this pull request Apr 11, 2022

Consul Raft State metrics disappear after prometheus retention period has passed #12704

Closed

dpw mentioned this pull request May 20, 2022

Prometheus metric to show whether a server is a leader or follower #13169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric definitions for all metrics known at Consul start #9198

Add metric definitions for all metrics known at Consul start #9198

mkcp commented Nov 14, 2020 •

edited

Loading

dnephin left a comment

dnephin Nov 16, 2020 •

edited

Loading

mkcp Nov 16, 2020

dnephin Nov 16, 2020

mkcp Nov 16, 2020

dnephin Nov 16, 2020

dnephin Nov 16, 2020

mkcp Nov 16, 2020

mkcp commented Nov 16, 2020

mkcp commented Nov 16, 2020

dnephin left a comment

dnephin Nov 16, 2020

mkcp Nov 16, 2020

dnephin Nov 16, 2020 •

edited

Loading

mkcp Nov 16, 2020

hashicorp-ci commented Nov 16, 2020

hashicorp-ci commented Nov 17, 2020

hashicorp-ci commented Nov 17, 2020

m1keil commented Dec 14, 2020

pierresouchay commented Dec 28, 2020

pierresouchay commented Dec 29, 2020

		@@ -0,0 +1,3 @@
		```release-note:improvement
		server: All metrics should be present and available to prometheus scrapers when Consul starts. If any non-deprecated metrics are missing please submit an issue with its name.

Add metric definitions for all metrics known at Consul start #9198

Add metric definitions for all metrics known at Consul start #9198

Conversation

mkcp commented Nov 14, 2020 • edited Loading

dnephin left a comment

Choose a reason for hiding this comment

dnephin Nov 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkcp commented Nov 16, 2020

mkcp commented Nov 16, 2020

dnephin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnephin Nov 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashicorp-ci commented Nov 16, 2020

hashicorp-ci commented Nov 17, 2020

hashicorp-ci commented Nov 17, 2020

m1keil commented Dec 14, 2020

pierresouchay commented Dec 28, 2020

pierresouchay commented Dec 29, 2020

mkcp commented Nov 14, 2020 •

edited

Loading

dnephin Nov 16, 2020 •

edited

Loading

dnephin Nov 16, 2020 •

edited

Loading