Change error handling when scraping metrics #551

lkysow · 2021-07-05T19:27:01Z

Fixes some issues with metrics merging. Namely that if there were any errors getting metrics from Envoy then we'd return a 200 with empty metrics instead of a 500 and that if the service returned an error, e.g. 400, we'd just stick the body of its response in our metrics output which would break Prometheus.

If Envoy returns an error then also respond with a 500 in our merged
metrics response so that Prometheus will know that we had an error, not
that there are no metrics.
If the service metrics return with a non-2xx status code then don't
include the response body in the merged metrics. This will stop issues
where users accidentally turn on metrics merging but they don't have an
exporter and so their metrics endpoint returns 404. I could have
responded with a 500 in this case in order to indicate that there is an
error, however I think it's more likely that users are accidentally
turning on metrics merging and the error indication is accomplished via
a new metric (see below).
Append a new metric that indicates the success of the service
scraping. This can be used for alerting by users since the response code
of the service metrics response is discarded:
- success: consul_metrics_merging_service_metrics_success 1
- fail: consul_metrics_merging_service_metrics_success 0
modify logging to use key/value pairs
Fixes Metrics Merging should ignore non 2xx service metrics #546

How I've tested this PR:

manually and unit tests

How I expect reviewers to test this PR:

code

Checklist:

Tests added
CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)

lkysow · 2021-07-05T19:28:22Z

@ndhanushkodi, interested in your thoughts on this. (see also exporter guidelines: https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes)

ndhanushkodi

The behavior looks quite nice to me. Just thinking out loud, could it make sense to have a metric similar to consul_metrics_merging_service_metrics_success for the envoy? But, I think it makes more sense what you're doing here so that the user sees the Prometheus scrape 500/404 from Envoy rather than a successful metrics scraping when something is definitely configured wrong.

I think all of the changes you've listed makes sense.

ndhanushkodi · 2021-07-09T04:40:23Z

subcommand/consul-sidecar/command.go

+	envoyMetricsAddr             = "http://127.0.0.1:19000/stats/prometheus"
+	// prometheusServiceMetricsSuccessKey is the key of the prometheus metrics used to
+	// indicate if service metrics were scraped successfully.
+	prometheusServiceMetricsSuccessKey = "consul_metrics_merging_service_metrics_success"


Suggested change

prometheusServiceMetricsSuccessKey = "consul_metrics_merging_service_metrics_success"

prometheusServiceMetricsSuccessKey = "consul_merged_service_metrics_success"

This was just the only idea I could come up with to shorten the name of this metric and still be specific, feel free to take it or not.

lkysow · 2021-11-23T22:23:21Z

could it make sense to have a metric similar to consul_metrics_merging_service_metrics_success for the envoy

I think that that metric now comes out of us returning a 500/200 depending on what happened. I'm sure there's a way to track that in prometheus?

ishustava

Looks good!

lkysow · 2021-12-02T03:43:35Z

Will rebase changelog after approval

ndhanushkodi

Looks great! Thanks for fixing this!!

* If Envoy returns an error then also respond with a 500 in our merged metrics response so that Prometheus will know that we had an error, not that there are no metrics. * If the service metrics return with a non-2xx status code then don't include the response body in the merged metrics. This will stop issues where users accidentally turn on metrics merging but they don't have an exporter and so their metrics endpoint returns 404. I could have responded with a 500 in this case in order to indicate that there is an error, however I think it's more likely that users are accidentally turning on metrics merging and the error indication is accomplished via a new metric (see below). * Append a new metric that indicates the success of the service scraping. This can be used for alerting by users since the response code of the service metrics response is discarded: * success: consul_metrics_merging_service_metrics_success 1 * fail: consul_metrics_merging_service_metrics_success 0 * modify logging to use key/value pairs * Fixes #546

ndhanushkodi reviewed Jul 9, 2021

View reviewed changes

lkysow force-pushed the lkysow/metrics-merging branch from a109f30 to 943cdca Compare July 10, 2021 19:15

lkysow force-pushed the lkysow/metrics-merging branch from 943cdca to b30129b Compare November 23, 2021 22:20

lkysow marked this pull request as ready for review November 23, 2021 23:01

lkysow requested review from a team, kschoche and ishustava and removed request for a team and kschoche November 23, 2021 23:01

ishustava approved these changes Nov 24, 2021

View reviewed changes

lkysow requested a review from ndhanushkodi November 30, 2021 19:19

ndhanushkodi approved these changes Dec 6, 2021

View reviewed changes

lkysow force-pushed the lkysow/metrics-merging branch from b7858ce to 4b9963e Compare December 6, 2021 20:06

lkysow merged commit d08821b into main Dec 6, 2021

lkysow deleted the lkysow/metrics-merging branch December 6, 2021 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change error handling when scraping metrics #551

Change error handling when scraping metrics #551

lkysow commented Jul 5, 2021 •

edited

Loading

lkysow commented Jul 5, 2021

ndhanushkodi left a comment

ndhanushkodi Jul 9, 2021

lkysow commented Nov 23, 2021

ishustava left a comment

lkysow commented Dec 2, 2021

ndhanushkodi left a comment

	prometheusServiceMetricsSuccessKey = "consul_metrics_merging_service_metrics_success"
	prometheusServiceMetricsSuccessKey = "consul_merged_service_metrics_success"

Change error handling when scraping metrics #551

Change error handling when scraping metrics #551

Conversation

lkysow commented Jul 5, 2021 • edited Loading

lkysow commented Jul 5, 2021

ndhanushkodi left a comment

Choose a reason for hiding this comment

ndhanushkodi Jul 9, 2021

Choose a reason for hiding this comment

lkysow commented Nov 23, 2021

ishustava left a comment

Choose a reason for hiding this comment

lkysow commented Dec 2, 2021

ndhanushkodi left a comment

Choose a reason for hiding this comment

lkysow commented Jul 5, 2021 •

edited

Loading