Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric definitions for all metrics known at Consul start #9198

Merged
merged 15 commits into from
Nov 16, 2020

Conversation

mkcp
Copy link
Contributor

@mkcp mkcp commented Nov 14, 2020

I went through to every metric written in Consul that I could find and created metric definitions for them. I also declare each raft metric that we consider a "key metric" for Consul - though we'll want to migrate those out to the raft lib in a future patch.

I then went back over the telemetry.mdx list to grab the help text and ensure I got all of our metrics. I don't guarantee that this list is complete, the search list is long and a few may have eluded me that weren't in telemetry.mdx, but it's pretty dang close. Everything that's critical for monitoring Consul is present. Any discrepancies I found between telemetry.mdx and Consul's codebase are documented here #9197. To keep the scope of this manageable: any metrics missing from telemetry.mdx I did not give a help text, nor did I delete any "stale" metrics in telemetry.mdx that no longer appear in Consul's codebase.

One important note for review: beside all of the definitions themselves, I do add functionality in the agent/setup.go file. There I reference all of our metrics in Consul and add the service namespace to them. Iterating over each metric to add the namespace in the agent prevents us from having to add the namespace to every definition by hand, leading to a difference between defining and using the metrics (see #9182 for the bug caused by putting the namespace in when writing the metric).

Here's the result! Beautiful metrics with beautiful helptext, guaranteed to be present throughout Consul's lifecycle, even if we haven't measured them within the prometheus_retention_time.
Screen Shot 2020-11-13 at 4 31 17 PM

(Finally 😉) Resolves #5140

@mkcp mkcp added type/bug Feature does not function as expected type/enhancement Proposed improvement or new feature theme/telemetry Anything related to telemetry or observability labels Nov 14, 2020
@mkcp mkcp added this to the 1.9.0 milestone Nov 14, 2020
@mkcp mkcp requested a review from a team November 14, 2020 00:38
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

I mostly looked at the plumbing that reads in all the definitions, I haven't looked at the definitions themselves.

Left a couple concerns/suggestions

lib/telemetry.go Outdated
Comment on lines 347 to 354
// EmptyPrometheusDefs returns a PrometheusDefs struct where each of the slices have zero elements, but not nil.
func EmptyPrometheusDefs() PrometheusDefs {
return PrometheusDefs{
Gauges: []prometheus.GaugeDefinition{},
Counters: []prometheus.CounterDefinition{},
Summaries: []prometheus.SummaryDefinition{},
}
}
Copy link
Contributor

@dnephin dnephin Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero value for structs and slices is usable without being initialized, so I don't think this function is necessary. The one place that uses it can use lib.PrometheusDefs{} which is equivalent, and avoids allocating memory for the slice headers.

Or, with my suggestion to remove that type, the other caller can pass in a TelemetryConfig with the PrometheusOpts field that uses the zero value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is good to know - I was a bit concerned to pass a nil into go-metrics where I have less ability to make changes. Thankfully go does what you'd want here for once and doesn't panic w/ ranges over nil slices 😂. Seems a bit inconsistent compared to nil elsewhere, but I prefer this behavior so I'll gladly take it.

lib/telemetry.go Outdated
// InitTelemetry configures go-metrics based on map of telemetry config
// values as returned by Runtimecfg.Config().
func InitTelemetry(cfg TelemetryConfig) (*metrics.InmemSink, error) {
func InitTelemetry(cfg TelemetryConfig, defs PrometheusDefs) (*metrics.InmemSink, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally if a function accepts a config or options struct all the config would be part of that struct. This convention seems appropriate here as well.

Instead of a new PrometheusDefs struct, TelemetryConfig can use a PrometheusOpts field with type prometheus.PrometheusOpts. This would remove the need for the PrometheusDefs which closely mirrors an existing struct, and would also also use to remove the PrometheusRetentionTime field, since that field is already in PrometheusOpts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree w/ that - the prometheus defs were a workaround to avoid adding an extra arg for each of the definition slices. I was a bit concerned about RuntimeConfig not mirroring the config we get in from disk like it does in many places. But adding the PrometheusOpts on there and translating it in builder.go does cut out of a bunch of boilerplate here.

agent/setup.go Outdated
// Set Consul to each definition's namespace
var withService []prometheus.GaugeDefinition
for _, gauge := range g {
gauge.Name = append(serviceName, gauge.Name...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are two problems here:

  1. this assumes that the TelemetryConfig.MetricsPrefix will always be consul. That is the default value, but it can be changed by the user. I believe this will initialize the wrong metrics if a user specifies a different metrics prefix.
  2. append does not modify the slice passed in as the first argument, but it can modify the underlying array used by that slice if the array is not at capacity. This is a subtle thing that can lead to very strange bugs. In this case the array backing serviceName could be used for every gauge.Name. So when the second iteration runs it will modify the value for the first iteration. Every gauge could end up with the same name, so only the very last one will get initialized. I believe it works currently because the array is being created with a capacity of 1, but to fix the first issue, that likely will not work anymore.

This is an example of the problem: https://play.golang.org/p/QFxtyGXrlqw

I believe the convention when using append is assign the result of append to the variable passed as the first argument. There are only a few rare cases when it can be assigned to another variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since go-metrics is the code that adds the prefix to the name it seems like it should also handle prepending this name. Otherwise every caller will have to guess at how go-metrics is building up the final metric name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should absolutely be using TelemetryConfig.MetricsPrefix that is a bug.

With the slice prepends, I agree that these should be handled by go-metrics -- I think for this release it's ok if we handle them here so long as we write an issue for it. (I've added some todos as well) It would be impossible to assign the var back to the first var and still be performing a prepend operation, we'd be appending in that case. It looks like if we allocate a new literal for each prepend, we avoid any potential bugs. https://play.golang.org/p/F_zPCegmka7

Here's the prometheus page with a different metrics_prefix supplied and using the literals in the commit following this comment.

Screen Shot 2020-11-16 at 1 54 28 PM

@mkcp
Copy link
Contributor Author

mkcp commented Nov 16, 2020

CI's failing for real, gotta fix the runtime_config test...

@mkcp
Copy link
Contributor Author

mkcp commented Nov 16, 2020

Done wrestling with the runtime and lib config tests. I worked around TelemetryConfig.MergeDefaults not supporting deep-merges in structs by checking for the correct struct type and leaving it alone. We don't have any prometheus config so this is fine™️ for now.

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Couple very minor suggestions for follow ups, but nothing blocking

Comment on lines +223 to +226
case reflect.Struct:
if f.Type() == reflect.TypeOf(prometheus.PrometheusOpts{}) {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any callers of TelemetryConfig.MergeDefaults in OSS or Ent. I suspect this MergeDefaults function is leftover from a while ago. The only place I see it called is a test, which is I guess what failed and prompted this change. It looks like the only non-test call to it was removed in 65be587

Not urgent right now, but in a follow up we could delete this method I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a note comment

@@ -0,0 +1,3 @@
```release-note:improvement
server: All metrics should be present and available to prometheus scrapers when Consul starts. If any non-deprecated metrics are missing please submit an issue with its name.
Copy link
Contributor

@dnephin dnephin Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor: I know we aren't always consistent with our categories, but I wonder if agent: would be more appropriate than server:.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - client agents are affected too

@mkcp mkcp merged commit d15b6fd into master Nov 16, 2020
@mkcp mkcp deleted the mkcp/telemetry/add-all-metric-definitions branch November 16, 2020 23:54
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/283165.

@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/283196.

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit d15b6fd onto release/1.9.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Nov 17, 2020
…-definitions

Add metric definitions for all metrics known at Consul start
@mkcp mkcp restored the mkcp/telemetry/add-all-metric-definitions branch November 17, 2020 00:24
mkcp added a commit that referenced this pull request Nov 17, 2020
…-definitions

Add metric definitions for all metrics known at Consul start
@mkcp mkcp deleted the mkcp/telemetry/add-all-metric-definitions branch November 17, 2020 00:29
@m1keil
Copy link

m1keil commented Dec 14, 2020

Followup question - does this changes renders prometheus_retention_time meaningless beyond the impact of enabling Prometheus metrics?

@pierresouchay
Copy link
Contributor

No, prometheus_retention_time is still licit with his value as computed values are cleaned when retention time is reached.

Found a few warning in Consul when gathering the metrics however: #9471

@pierresouchay
Copy link
Contributor

@mkcp This patch did create a few issues in Consul 1.9.x as described in #9471 , but I think I have a patch in hashicorp/go-metrics#122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/telemetry Anything related to telemetry or observability type/bug Feature does not function as expected type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A lot of Prometheus metrics are not initialized
6 participants