Extend metrics and rebuild prometheus exporting logic #2567

m3co-code · 2017-12-14T11:30:12Z

I know that this PR is large but it grew a long time in our fork now and the changes depend on each other which makes it very hard to split those up into several commits.
I am happy to assist in any way it helps you to make a review, also a e.g. a peer-review via video chat is completely fine for me.
Just let me know how I can help getting this merged.
To make it a bit better I added an extensive commit description that I quoted below to explain the different measures.
To wet the appetite why this is useful here a short list of which metrics Traefik supports now for Prometheus and it is very easy for other providers to move along!
To get the other providers on the same feature level (not required to work right now, see commit description) I shoutout to @aantono and @adityacs who integrated those to extend the implementation of those providers.

Which metrics exist now with small samples (numbers are totally random):

traefik_config_reloads_total: useful to see how often config changes happen
traefik_config_reloads_failure_total: useful to see failing config reloads and alert on those
traefik_config_last_reload_success: useful to see when the last reload has happened
traefik_config_last_reload_failure: useful to see when the last config reload has happened
traefik_entrypoint_requests_total: same as before but on entrypoint level now
traefik_entrypoint_request_duration_seconds: same as before but on entrypoint level now
traefik_entrypoint_open_connections: useful to see how many open connections exist currently. It's split up even by protocol, so you could understand e.g. there 1k open websocket connections, 300 open SSE connections and currently 1.5k http requests in flight
traefik_backend_requests_total: same as before but on backend level
traefik_backend_request_duration_seconds: same as before but on backend level
traefik_backend_retries_total: same as before but on backend level
traefik_backend_open_connections: same as in entrypoint description but split up for each backend
traefik_backend_server_up: you can see how many backend servers exist and are healthy of a specific backend. useful to alert when e.g. 80% of your backend servers become unhealthy.

Additionally, this PR tackles #2349 and #1759.

Commit message with detailed explanations

This commit extends the metrics Registry interface and extends it's usage in order to get more insightful metrics. It also renames Prometheus metrics and the service label (to either backend or entrypoint) in order to fix #1759.

In order to extend the Registry interface the MultiRegistry is used so that single metric providers do not necessarily have to implement all of the metrics, but only a subset.
At the moment only the Prometheus implementation supports all new metrics, but extension of the other providers should be straightforward and can be done afterwards.

Additionally the Prometheus exporting logic was adapted.
The problem with it before was that the Prometheus client library is stateful and samples all tracked metrics on the client.
This is a problem given the things you are tracking are dynamic.
We have a metric traefik_backend_server_up which can be used to alert given too many servers of a backend become unhealthy.
Now when Traefik learns about a new backend it will create the concrete metric with the labels for this backend.
The problem is now that given the backend is undeployed, Traefik will still export the metric of the old backend because it holds the Prometheus metric state and it will only be gone after a restart of Traefik.
This leads to wrong results when you calculate e.g. how many of your backends are healthy / all backends and thus breaks alerting.
To fix this one has to implement a custom logic for the metric sampling.
I got a confirmation for this approach on the Prometheus users google group from Prometheus maintainers.
To fix the problem I implemented a configuration generation logic so that metrics that belong to a dynamic configuration (e.g. backends, entrypoints) get removed again eventually given they belong to an old configuration and are not tracked recently.
https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc

This PR also splits up the metric collection of requests, request durations and retries into separate metrics for entrypoints and backends.
This will fix #1759. Before there was only one metric traefik_requests_total which was tracked at the entrypoint level with the label service="$entrypointname" and at the backend level with service="$backendname".
So at each request two traefik_requests_total metrics were trackend and so a calculation of e.g. sum(traefik_requests_total) would yield twice the requests that actually happened.
This is fixed by splitting up the metric into two.
For other metric providers this problem was the same and as I did not adapt those yet, they will only count the requests now against the backends.
An extension of them should happen afterwards.

Fixes #2349 #1759

aantono · 2017-12-18T06:21:51Z

Great addition... I'll add the DataDog and StatsD counters as soon as this PR gets merged!

adityacs · 2017-12-18T20:26:48Z

Will update influx metrics as well once this PR is merged.

ldez · 2018-01-18T09:55:21Z

@marco-jantke could you rebase?

m3co-code · 2018-01-19T15:30:25Z

@ldez done.

mmatur

LGTM

m3co-code · 2018-01-26T09:17:45Z

Can we get this moving?

juliens

👏

server/server.go

juliens

LGTM

ldez

LGTM

ldez · 2018-01-26T10:39:59Z

@aantono and @adityacs now it's up to you 😃

This commit extends the metrics Registry interface and extends it's usage in order to get more insightful metrics. It also renames Prometheus metrics and the service label (to either backend or entrypoint) in order to fix traefik#1759. In order to extend the Registry interface the MultiRegistry is used so that single metric providers do not necessarily have to implement all of the metrics, but only a subset. At the moment only the Prometheus implementation supports all new metrics, but extension of the other providers should be straightforward and can be done afterwards. Additionally the Prometheus exporting logic was adapted. The problem with it before was that the Prometheus client library is stateful and samples all tracked metrics on the client. This is a problem given the things you are tracking are dynamic. We have a metric traefik_backend_server_up which can be used to alert given too many servers of a backend become unhealthy. Now when Traefik learns about a new backend it will create the concrete metric with the labels for this backend. The problem is now that given the backend is undeployed, Traefik will still export the metric of the old backend because it holds the Prometheus metric state and it will only be gone after a restart of Traefik. This leads to wrong results when you calculate e.g. how many of your backends are healthy / all backends and thus breaks alerting. To fix this one has to implement a custom logic for the metric sampling. I got a confirmation for this approach on the Prometheus users google group from Prometheus maintainers. To fix the problem I implemented a configuration generation logic so that metrics that belong to a dynamic configuration (e.g. backends, entrypoints) get removed again eventually given they belong to an old configuration and are not tracked recently. https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc This PR also splits up the metric collection of requests, request durations and retries into separate metrics for entrypoints and backends. This will fix traefik#1759. Before there was only one metric traefik_requests_total which was tracked at the entrypoint level with the label service="$entrypointname" and at the backend level with service="$backendname". So at each request two traefik_requests_total metrics were trackend and so a calculation of e.g. sum(traefik_requests_total) would yield twice the requests that actually happened. This is fixed by splitting up the metric into two. For other metric providers this problem was the same and as I did not adapt those yet, they will only count the requests now against the backends. An extension of them should happen afterwards.

traefiker added area/middleware size/L status/0-needs-triage labels Dec 14, 2017

This was referenced Dec 14, 2017

Prometheus request metrics tracked twice #1759

Closed

Websocket metrics #2349

Closed

ldez added area/middleware/metrics status/2-needs-review and removed status/0-needs-triage labels Dec 14, 2017

ldez removed the area/middleware label Dec 18, 2017

mmatur added the kind/enhancement a new or improved feature. label Dec 18, 2017

mmatur added this to the 1.6 milestone Dec 18, 2017

mmatur self-assigned this Jan 18, 2018

m3co-code force-pushed the extend-metrics-and-fix-prometheus-exporting branch from df7beab to 91de540 Compare January 19, 2018 15:24

m3co-code force-pushed the extend-metrics-and-fix-prometheus-exporting branch from 91de540 to 95cbeab Compare January 19, 2018 16:20

mmatur approved these changes Jan 22, 2018

View reviewed changes

juliens requested changes Jan 26, 2018

View reviewed changes

server/server.go Outdated Show resolved Hide resolved

traefiker added the contributor/waiting-for-corrections label Jan 26, 2018

juliens approved these changes Jan 26, 2018

View reviewed changes

traefiker removed the contributor/waiting-for-corrections label Jan 26, 2018

ldez approved these changes Jan 26, 2018

View reviewed changes

ldez added status/3-needs-merge and removed status/2-needs-review labels Jan 26, 2018

traefiker added the status/4-merge-in-progress label Jan 26, 2018

convert method to function as receiver wasn't used

db7d6dd

traefiker force-pushed the extend-metrics-and-fix-prometheus-exporting branch from a73f3cb to db7d6dd Compare January 26, 2018 10:42

traefiker removed the status/4-merge-in-progress label Jan 26, 2018

traefiker merged commit cc5ee00 into traefik:master Jan 26, 2018

traefiker removed the status/3-needs-merge label Jan 26, 2018

aantono mentioned this pull request Feb 19, 2018

Added missing metrics to registry for DataDog and StatsD #2890

Merged

ldez mentioned this pull request Feb 26, 2018

Feature request: Expose meaningful websocket/SSE metrics #2905

Closed

adityacs mentioned this pull request Mar 10, 2018

Added entrypoint metrics to influxdb #2992

Merged

1 task

ldez changed the title ~~extend metrics and rebuild prometheus exporting logic~~ Extend metrics and rebuild prometheus exporting logic Mar 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend metrics and rebuild prometheus exporting logic #2567

Extend metrics and rebuild prometheus exporting logic #2567

m3co-code commented Dec 14, 2017 •

edited by ldez

Loading

aantono commented Dec 18, 2017

adityacs commented Dec 18, 2017

ldez commented Jan 18, 2018

m3co-code commented Jan 19, 2018

mmatur left a comment

m3co-code commented Jan 26, 2018

juliens left a comment

juliens left a comment

ldez left a comment

ldez commented Jan 26, 2018

Extend metrics and rebuild prometheus exporting logic #2567

Extend metrics and rebuild prometheus exporting logic #2567

Conversation

m3co-code commented Dec 14, 2017 • edited by ldez Loading

aantono commented Dec 18, 2017

adityacs commented Dec 18, 2017

ldez commented Jan 18, 2018

m3co-code commented Jan 19, 2018

mmatur left a comment

Choose a reason for hiding this comment

m3co-code commented Jan 26, 2018

juliens left a comment

Choose a reason for hiding this comment

juliens left a comment

Choose a reason for hiding this comment

ldez left a comment

Choose a reason for hiding this comment

ldez commented Jan 26, 2018

m3co-code commented Dec 14, 2017 •

edited by ldez

Loading