Missing backend metrics #3365

ncgee · 2018-05-22T12:26:59Z

Do you want to request a feature or report a bug?

Bug

What did you do?

Roll out traefik 1.6.1 in a Kubernetes cluster with Prometheus metrics scraping, using ingress resources for dynamic configuration.

What did you expect to see?

The value of traefik_entrypoint_requests_total{code="500"} equal to the value of sum(traefik_backend_requests_total{code="500"}).

What did you see instead?

traefik_entrypoint_requests_total{code="500"} had a value of around 8000, whereas the total for traefik_backend_requests_total{code="500"} was around 8. The value of the entrypoints metric was reflected in the access log. Additionally, running traefik bug as suggested completely removed these stats which preventing me from providing more exact values. Oddly, stats for other status codes (200, 302) are still present.

Output of `traefik version`: (What version of Traefik are you using?)

Version:      v1.6.1
Codename:     tetedemoine
Go version:   go1.10.2
Built:        2018-05-14_07:15:23PM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

    defaultEntryPoints = ["http","https"]
    logLevel = "INFO"
    [respondingTimeouts]
      readTimeout = "60s"
      writeTimeout = "60s"
    [traefikLog]
      format = "json"
    [accessLog]
      format = "json"
      [accessLog.fields]
        [accessLog.fields.names]
          "ClientAddr" = "drop"
          "ClientPort" = "drop"
          "RequestPath" = "drop"
          "RequestProtocol" = "drop"
          "StartLocal" = "drop"
        [accessLog.fields.headers]
          [accessLog.fields.headers.names]
            "Etag" = "drop"
    [kubernetes]
      namespaces = ["redacted"]
    [entryPoints]
      [entryPoints.http]
        address = ":80"
        compress = true
        [entryPoints.http.proxyProtocol]
          trustedIPs = ["redacted"]
        [entryPoints.http.redirect]
          entryPoint = "https"
      [entryPoints.https]
        address = ":443"
        compress = true
        [entryPoints.https.proxyProtocol]
          trustedIPs = ["redacted"]
        [entryPoints.https.tls]
        MinVersion = "VersionTLS11"
          CipherSuites = ["TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA","TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA","TLS_RSA_WITH_AES_128_GCM_SHA256","TLS_RSA_WITH_AES_256_GCM_SHA384","TLS_RSA_WITH_AES_128_CBC_SHA","TLS_RSA_WITH_AES_256_CBC_SHA"]
          [[entryPoints.https.tls.certificates]]
            CertFile = "/etc/tls/tls.crt"
            KeyFile = "/etc/tls/tls.key"
    [api]
      dashboard = true
      [api.statistics]
        recentErrors = 10
    [ping]
    [metrics]
      [metrics.prometheus]
        buckets = [0.1,0.3,1.2,5.0]
      [metrics.statistics]
        recentErrors = 10
    [acme]
      email = "redacted"
      storage = "/acme/acme.json"
      entryPoint = "https"
      onDemand = false
      onHostRule = true
      acmeLogging = true
      [acme.httpChallenge]
        entryPoint = "http"

The text was updated successfully, but these errors were encountered:

emilevauge · 2018-05-22T14:24:43Z

Ping @marco-jantke, any idea on this possible wrong sum behavior ?

m3co-code · 2018-05-22T15:54:05Z

The metrics can derive. This is all requests that are not "routable" to any of the backends (no matching frontends) will have the counter increased for the entrypoint metric but naturally not for the backend metric. This should be basically only 404s, though. Are you sure that the requests match to a backend, so a matching frontend is present?

ncgee · 2018-05-22T16:09:52Z

Yes quite sure, coincidentally I noticed this discrepancy after looking into the different metric values for 404s. One specific backend was responsible for the bulk of the 500s, with well over 8000 log lines but the metric showing a value of about 5 if I recall correctly, and without restarts or redeploys of the pod. The logs show the requests as having both a FrontendName and BackendName value.

Additionally, but perhaps this should be a different issue, I noticed that metrics do not exist for backends that haven't handled any requests, where I would expect the metrics to exist with a value of 0.

m3co-code · 2018-05-23T07:22:46Z

Can you paste me one of the log lines for a request that wasn't counted in the metrics?

Regarding pre-population of metrics: we pre-poluate the backend_server_up metric, but not all the requests / retries. This would sum up to many many pre-populated metrics, it would have to iterate all combinations of HTTP status codes with HTTP methods on each backend etc. In combination with the dynamic nature of Traefik I don't think this is what we want to do.

ncgee · 2018-05-23T08:05:28Z

Here is one, I had to redact some things though:

{"BackendAddr":"redacted","BackendName":"redacted","BackendURL":{"Scheme":"http","Opaque":"","User":null,"Host":"redacted","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"ClientHost":"redacted","ClientUsername":"-","DownstreamContentSize":108,"DownstreamStatus":500,"DownstreamStatusLine":"500 Internal Server Error","Duration":12766684,"FrontendName":"redacted","OriginContentSize":108,"OriginDuration":12478122,"OriginStatus":500,"OriginStatusLine":"500 Internal Server Error","Overhead":288562,"RequestAddr":"redacted","RequestContentSize":0,"RequestCount":33668,"RequestHost":"redacted","RequestLine":"GET /redacted HTTP/1.1","RequestMethod":"GET","RequestPort":"-","RetryAttempts":0,"StartUTC":"2018-05-22T08:42:04.086828695Z","downstream_Content-Type":"application/json;charset=UTF-8","downstream_Date":"Tue, 22 May 2018 08:42:04 GMT","downstream_Vary":"Accept-Encoding","level":"info","msg":"","origin_Content-Type":"application/json;charset=UTF-8","origin_Date":"Tue, 22 May 2018 08:42:04 GMT","origin_Vary":"Accept-Encoding","request_Accept":"*/*","request_Accept-Encoding":"gzip, deflate","request_Accept-Language":"en-us","request_Cache-Control":"no-cache","request_Connection":"keep-alive","request_Pragma":"no-cache","request_User-Agent":"redacted","request_X-Forwarded-For":"redacted","request_X-Forwarded-Host":"redacted","request_X-Forwarded-Port":"443","request_X-Forwarded-Proto":"https","request_X-Forwarded-Server":"redacted","request_X-Real-Ip":"redacted","time":"2018-05-22T08:42:04Z"}

I've also added some screenshots, the entrypoint metric has magically reappeared but the backend metrics are still missing since running traefik bug:

m3co-code · 2018-05-23T13:19:52Z

Thanks for the report. I'm going to have a deeper look into this and whether we have similar problems.

About the re-appearing metric it can be that you are affected of something that will be fixed with #3287.

m3co-code · 2018-05-25T07:15:24Z

After digging into the issue I figured out that resetting metrics was not properly implemented. I pushed another commit to my PR acce4d9e533f2d7931a5b3738196a835f20b34d8 that will fix that. I'll also try to change the target of my PR to the 1.6. branch, in order to get this officially released sooner.

A bit of background for you on what's the issue here: as the configuration of Traefik is dynamic, we had to make sure to remove metrics that belong to those dynamic configurations (e.g. traefik_backend_requests_total{backend="my-backend"}) from the /metrics output, once that part of the dynamic configuration doesn't exist anymore. Otherwise, you can get wrong measurements. Note: we ensure all metrics have been scraped at least once before we remove them. In Traefik 1.6 this metric removal was based on a rather poor heuristic, given the metric wasn't tracked in the time-frame of the last 3 configuration reloads, it would get removed. This fact + that the metric value wasn't reset properly, is responsible for the re-appearing entrypoint, at least so I guess. My PR #3287 will fix that by a) fixing the bug so that removed metrics are also reset and b) it removes metrics based on the actualy configuration of Traefik and not based on an heuristic anymore.

ncgee · 2018-05-25T10:13:40Z

Thanks for looking into this, would be nice to have the fix in the 1.6 branch. Coincidentally the missing backend metric (re)appeared today with the expected value, while others went missing:

I suppose that makes sense given your explanation.

m3co-code · 2018-06-11T07:48:21Z

The bug should be fixed in the latest release: https://github.com/containous/traefik/releases/tag/v1.6.3

Can you try it out and see whether it fixed your problems as well?

ncgee · 2018-06-11T08:14:28Z

Thanks for your work on this! The metrics no longer disappear. One minor note is that Prometheus appears to have reset the counters where I would expect it to take the 'restart' into account, have the metrics themselves changed a bit as well or is this expected behaviour?

m3co-code · 2018-06-11T08:23:28Z

No, the counter reset is totally expected. This is how Prometheus works. Functions like rate, irate or increase take restarts into account and calculate the proper result regardless of restarts. Having only the counter value alone is hardly of any use with Prometheus.

I'm closing the issue therefore, thanks for reporting and in case you run into more problems, let us know :-)

traefiker added the status/0-needs-triage label May 22, 2018

ldez added the area/middleware/metrics label May 22, 2018

emilevauge added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. and removed status/0-needs-triage labels May 22, 2018

ldez added kind/bug/confirmed a confirmed bug (reproducible). priority/P1 need to be fixed in next release and removed kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. labels May 25, 2018

m3co-code closed this as completed Jun 11, 2018

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing backend metrics #3365

Missing backend metrics #3365

ncgee commented May 22, 2018

emilevauge commented May 22, 2018

m3co-code commented May 22, 2018 •

edited

Loading

ncgee commented May 22, 2018

m3co-code commented May 23, 2018

ncgee commented May 23, 2018

m3co-code commented May 23, 2018

m3co-code commented May 25, 2018

ncgee commented May 25, 2018

m3co-code commented Jun 11, 2018

ncgee commented Jun 11, 2018

m3co-code commented Jun 11, 2018

Missing backend metrics #3365

Missing backend metrics #3365

Comments

ncgee commented May 22, 2018

Do you want to request a feature or report a bug?

What did you do?

What did you expect to see?

What did you see instead?

Output of traefik version: (What version of Traefik are you using?)

What is your environment & configuration (arguments, toml, provider, platform, ...)?

emilevauge commented May 22, 2018

m3co-code commented May 22, 2018 • edited Loading

ncgee commented May 22, 2018

m3co-code commented May 23, 2018

ncgee commented May 23, 2018

m3co-code commented May 23, 2018

m3co-code commented May 25, 2018

ncgee commented May 25, 2018

m3co-code commented Jun 11, 2018

ncgee commented Jun 11, 2018

m3co-code commented Jun 11, 2018

Output of `traefik version`: (What version of Traefik are you using?)

m3co-code commented May 22, 2018 •

edited

Loading