Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend metrics and rebuild prometheus exporting logic #2567

Merged

Conversation

m3co-code
Copy link
Contributor

@m3co-code m3co-code commented Dec 14, 2017

I know that this PR is large but it grew a long time in our fork now and the changes depend on each other which makes it very hard to split those up into several commits.
I am happy to assist in any way it helps you to make a review, also a e.g. a peer-review via video chat is completely fine for me.
Just let me know how I can help getting this merged.
To make it a bit better I added an extensive commit description that I quoted below to explain the different measures.
To wet the appetite why this is useful here a short list of which metrics Traefik supports now for Prometheus and it is very easy for other providers to move along!
To get the other providers on the same feature level (not required to work right now, see commit description) I shoutout to @aantono and @adityacs who integrated those to extend the implementation of those providers.

Which metrics exist now with small samples (numbers are totally random):

  • traefik_config_reloads_total: useful to see how often config changes happen
  • traefik_config_reloads_failure_total: useful to see failing config reloads and alert on those
  • traefik_config_last_reload_success: useful to see when the last reload has happened
  • traefik_config_last_reload_failure: useful to see when the last config reload has happened
  • traefik_entrypoint_requests_total: same as before but on entrypoint level now
  • traefik_entrypoint_request_duration_seconds: same as before but on entrypoint level now
  • traefik_entrypoint_open_connections: useful to see how many open connections exist currently. It's split up even by protocol, so you could understand e.g. there 1k open websocket connections, 300 open SSE connections and currently 1.5k http requests in flight
  • traefik_backend_requests_total: same as before but on backend level
  • traefik_backend_request_duration_seconds: same as before but on backend level
  • traefik_backend_retries_total: same as before but on backend level
  • traefik_backend_open_connections: same as in entrypoint description but split up for each backend
  • traefik_backend_server_up: you can see how many backend servers exist and are healthy of a specific backend. useful to alert when e.g. 80% of your backend servers become unhealthy.

Additionally, this PR tackles #2349 and #1759.

Commit message with detailed explanations

This commit extends the metrics Registry interface and extends it's usage in order to get more insightful metrics. It also renames Prometheus metrics and the service label (to either backend or entrypoint) in order to fix #1759.

In order to extend the Registry interface the MultiRegistry is used so that single metric providers do not necessarily have to implement all of the metrics, but only a subset.
At the moment only the Prometheus implementation supports all new metrics, but extension of the other providers should be straightforward and can be done afterwards.

Additionally the Prometheus exporting logic was adapted.
The problem with it before was that the Prometheus client library is stateful and samples all tracked metrics on the client.
This is a problem given the things you are tracking are dynamic.
We have a metric traefik_backend_server_up which can be used to alert given too many servers of a backend become unhealthy.
Now when Traefik learns about a new backend it will create the concrete metric with the labels for this backend.
The problem is now that given the backend is undeployed, Traefik will still export the metric of the old backend because it holds the Prometheus metric state and it will only be gone after a restart of Traefik.
This leads to wrong results when you calculate e.g. how many of your backends are healthy / all backends and thus breaks alerting.
To fix this one has to implement a custom logic for the metric sampling.
I got a confirmation for this approach on the Prometheus users google group from Prometheus maintainers.
To fix the problem I implemented a configuration generation logic so that metrics that belong to a dynamic configuration (e.g. backends, entrypoints) get removed again eventually given they belong to an old configuration and are not tracked recently.
https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc

This PR also splits up the metric collection of requests, request durations and retries into separate metrics for entrypoints and backends.
This will fix #1759. Before there was only one metric traefik_requests_total which was tracked at the entrypoint level with the label service="$entrypointname" and at the backend level with service="$backendname".
So at each request two traefik_requests_total metrics were trackend and so a calculation of e.g. sum(traefik_requests_total) would yield twice the requests that actually happened.
This is fixed by splitting up the metric into two.
For other metric providers this problem was the same and as I did not adapt those yet, they will only count the requests now against the backends.
An extension of them should happen afterwards.

Fixes #2349 #1759

@aantono
Copy link
Contributor

aantono commented Dec 18, 2017

Great addition... I'll add the DataDog and StatsD counters as soon as this PR gets merged!

@mmatur mmatur added the kind/enhancement a new or improved feature. label Dec 18, 2017
@mmatur mmatur added this to the 1.6 milestone Dec 18, 2017
@adityacs
Copy link
Contributor

Will update influx metrics as well once this PR is merged.

@ldez
Copy link
Contributor

ldez commented Jan 18, 2018

@marco-jantke could you rebase?

@mmatur mmatur self-assigned this Jan 18, 2018
@m3co-code m3co-code force-pushed the extend-metrics-and-fix-prometheus-exporting branch from df7beab to 91de540 Compare January 19, 2018 15:24
@m3co-code
Copy link
Contributor Author

@ldez done.

@m3co-code m3co-code force-pushed the extend-metrics-and-fix-prometheus-exporting branch from 91de540 to 95cbeab Compare January 19, 2018 16:20
Copy link
Member

@mmatur mmatur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@m3co-code
Copy link
Contributor Author

Can we get this moving?

Copy link
Member

@juliens juliens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

server/server.go Outdated Show resolved Hide resolved
Copy link
Member

@juliens juliens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ldez ldez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldez
Copy link
Contributor

ldez commented Jan 26, 2018

@aantono and @adityacs now it's up to you 😃

This commit extends the metrics Registry interface and extends it's
usage in order to get more insightful metrics. It also renames
Prometheus metrics and the service label (to either backend or
entrypoint) in order to fix traefik#1759.

In order to extend the Registry interface the MultiRegistry is used so
that single metric providers do not necessarily have to implement all of
the metrics, but only a subset. At the moment only the
Prometheus implementation supports all new metrics, but extension of the
other providers should be straightforward and can be done afterwards.

Additionally the Prometheus exporting logic was adapted. The problem
with it before was that the Prometheus client library is stateful and
samples all tracked metrics on the client. This is a problem given the
things you are tracking are dynamic. We have a metric
traefik_backend_server_up which can be used to alert given too many
servers of a backend become unhealthy. Now when Traefik learns about a
new backend it will create the concrete metric with the labels for this
backend. The problem is now that given the backend is undeployed,
Traefik will still export the metric of the old backend because it holds
the Prometheus metric state and it will only be gone after a restart of
Traefik. This leads to wrong results when you calculate e.g. how many of
your backends are healthy / all backends and thus breaks alerting. To
fix this one has to implement a custom logic for the metric sampling. I
got a confirmation for this approach on the Prometheus users google
group from Prometheus maintainers. To fix the problem I implemented a
configuration generation logic so that metrics that belong to a dynamic
configuration (e.g. backends, entrypoints) get removed  again eventually
given they belong to an old configuration and are not tracked recently.
https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc

This PR also splits up the metric collection of requests, request
durations and retries into separate metrics for entrypoints and
backends. This will fix traefik#1759. Before there was only one metric
traefik_requests_total which was tracked at the entrypoint level with
the label service="$entrypointname" and at the backend level with
service="$backendname". So at each request two traefik_requests_total
metrics were trackend and so a calculation of e.g.
sum(traefik_requests_total) would yield twice the requests that actually
happened. This is fixed by splitting up the metric into two. For other
metric providers this problem was the same and as I did not adapt those
yet, they will only count the requests now against the backends. An
extension of them should happen afterwards.
@traefiker traefiker force-pushed the extend-metrics-and-fix-prometheus-exporting branch from a73f3cb to db7d6dd Compare January 26, 2018 10:42
@traefiker traefiker merged commit cc5ee00 into traefik:master Jan 26, 2018
@ldez ldez changed the title extend metrics and rebuild prometheus exporting logic Extend metrics and rebuild prometheus exporting logic Mar 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Websocket metrics Prometheus request metrics tracked twice
7 participants