-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend metrics and rebuild prometheus exporting logic #2567
Extend metrics and rebuild prometheus exporting logic #2567
Conversation
Great addition... I'll add the DataDog and StatsD counters as soon as this PR gets merged! |
Will update influx metrics as well once this PR is merged. |
@marco-jantke could you rebase? |
df7beab
to
91de540
Compare
@ldez done. |
91de540
to
95cbeab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can we get this moving? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This commit extends the metrics Registry interface and extends it's usage in order to get more insightful metrics. It also renames Prometheus metrics and the service label (to either backend or entrypoint) in order to fix traefik#1759. In order to extend the Registry interface the MultiRegistry is used so that single metric providers do not necessarily have to implement all of the metrics, but only a subset. At the moment only the Prometheus implementation supports all new metrics, but extension of the other providers should be straightforward and can be done afterwards. Additionally the Prometheus exporting logic was adapted. The problem with it before was that the Prometheus client library is stateful and samples all tracked metrics on the client. This is a problem given the things you are tracking are dynamic. We have a metric traefik_backend_server_up which can be used to alert given too many servers of a backend become unhealthy. Now when Traefik learns about a new backend it will create the concrete metric with the labels for this backend. The problem is now that given the backend is undeployed, Traefik will still export the metric of the old backend because it holds the Prometheus metric state and it will only be gone after a restart of Traefik. This leads to wrong results when you calculate e.g. how many of your backends are healthy / all backends and thus breaks alerting. To fix this one has to implement a custom logic for the metric sampling. I got a confirmation for this approach on the Prometheus users google group from Prometheus maintainers. To fix the problem I implemented a configuration generation logic so that metrics that belong to a dynamic configuration (e.g. backends, entrypoints) get removed again eventually given they belong to an old configuration and are not tracked recently. https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc This PR also splits up the metric collection of requests, request durations and retries into separate metrics for entrypoints and backends. This will fix traefik#1759. Before there was only one metric traefik_requests_total which was tracked at the entrypoint level with the label service="$entrypointname" and at the backend level with service="$backendname". So at each request two traefik_requests_total metrics were trackend and so a calculation of e.g. sum(traefik_requests_total) would yield twice the requests that actually happened. This is fixed by splitting up the metric into two. For other metric providers this problem was the same and as I did not adapt those yet, they will only count the requests now against the backends. An extension of them should happen afterwards.
a73f3cb
to
db7d6dd
Compare
I know that this PR is large but it grew a long time in our fork now and the changes depend on each other which makes it very hard to split those up into several commits.
I am happy to assist in any way it helps you to make a review, also a e.g. a peer-review via video chat is completely fine for me.
Just let me know how I can help getting this merged.
To make it a bit better I added an extensive commit description that I quoted below to explain the different measures.
To wet the appetite why this is useful here a short list of which metrics Traefik supports now for Prometheus and it is very easy for other providers to move along!
To get the other providers on the same feature level (not required to work right now, see commit description) I shoutout to @aantono and @adityacs who integrated those to extend the implementation of those providers.
Which metrics exist now with small samples (numbers are totally random):
traefik_config_reloads_total
: useful to see how often config changes happentraefik_config_reloads_failure_total
: useful to see failing config reloads and alert on thosetraefik_config_last_reload_success
: useful to see when the last reload has happenedtraefik_config_last_reload_failure
: useful to see when the last config reload has happenedtraefik_entrypoint_requests_total
: same as before but on entrypoint level nowtraefik_entrypoint_request_duration_seconds
: same as before but on entrypoint level nowtraefik_entrypoint_open_connections
: useful to see how many open connections exist currently. It's split up even by protocol, so you could understand e.g. there 1k open websocket connections, 300 open SSE connections and currently 1.5k http requests in flighttraefik_backend_requests_total
: same as before but on backend leveltraefik_backend_request_duration_seconds
: same as before but on backend leveltraefik_backend_retries_total
: same as before but on backend leveltraefik_backend_open_connections
: same as in entrypoint description but split up for each backendtraefik_backend_server_up
: you can see how many backend servers exist and are healthy of a specificbackend.
useful to alert when e.g. 80% of your backend servers become unhealthy.Additionally, this PR tackles #2349 and #1759.
Commit message with detailed explanations
This commit extends the metrics Registry interface and extends it's usage in order to get more insightful metrics. It also renames Prometheus metrics and the service label (to either backend or entrypoint) in order to fix #1759.
In order to extend the Registry interface the MultiRegistry is used so that single metric providers do not necessarily have to implement all of the metrics, but only a subset.
At the moment only the Prometheus implementation supports all new metrics, but extension of the other providers should be straightforward and can be done afterwards.
Additionally the Prometheus exporting logic was adapted.
The problem with it before was that the Prometheus client library is stateful and samples all tracked metrics on the client.
This is a problem given the things you are tracking are dynamic.
We have a metric traefik_backend_server_up which can be used to alert given too many servers of a backend become unhealthy.
Now when Traefik learns about a new backend it will create the concrete metric with the labels for this backend.
The problem is now that given the backend is undeployed, Traefik will still export the metric of the old backend because it holds the Prometheus metric state and it will only be gone after a restart of Traefik.
This leads to wrong results when you calculate e.g. how many of your backends are healthy / all backends and thus breaks alerting.
To fix this one has to implement a custom logic for the metric sampling.
I got a confirmation for this approach on the Prometheus users google group from Prometheus maintainers.
To fix the problem I implemented a configuration generation logic so that metrics that belong to a dynamic configuration (e.g. backends, entrypoints) get removed again eventually given they belong to an old configuration and are not tracked recently.
https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc
This PR also splits up the metric collection of requests, request durations and retries into separate metrics for entrypoints and backends.
This will fix #1759. Before there was only one metric
traefik_requests_total
which was tracked at the entrypoint level with the labelservice="$entrypointname"
and at the backend level withservice="$backendname"
.So at each request two
traefik_requests_total
metrics were trackend and so a calculation of e.g.sum(traefik_requests_total)
would yield twice the requests that actually happened.This is fixed by splitting up the metric into two.
For other metric providers this problem was the same and as I did not adapt those yet, they will only count the requests now against the backends.
An extension of them should happen afterwards.
Fixes #2349 #1759