Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing backend metrics #3365

Closed
ncgee opened this issue May 22, 2018 · 11 comments
Closed

Missing backend metrics #3365

ncgee opened this issue May 22, 2018 · 11 comments
Labels
area/middleware/metrics kind/bug/confirmed a confirmed bug (reproducible). priority/P1 need to be fixed in next release status/5-frozen-due-to-age

Comments

@ncgee
Copy link

ncgee commented May 22, 2018

Do you want to request a feature or report a bug?

Bug

What did you do?

Roll out traefik 1.6.1 in a Kubernetes cluster with Prometheus metrics scraping, using ingress resources for dynamic configuration.

What did you expect to see?

The value of traefik_entrypoint_requests_total{code="500"} equal to the value of sum(traefik_backend_requests_total{code="500"}).

What did you see instead?

traefik_entrypoint_requests_total{code="500"} had a value of around 8000, whereas the total for traefik_backend_requests_total{code="500"} was around 8. The value of the entrypoints metric was reflected in the access log. Additionally, running traefik bug as suggested completely removed these stats which preventing me from providing more exact values. Oddly, stats for other status codes (200, 302) are still present.

Output of traefik version: (What version of Traefik are you using?)

Version:      v1.6.1
Codename:     tetedemoine
Go version:   go1.10.2
Built:        2018-05-14_07:15:23PM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

    defaultEntryPoints = ["http","https"]
    logLevel = "INFO"
    [respondingTimeouts]
      readTimeout = "60s"
      writeTimeout = "60s"
    [traefikLog]
      format = "json"
    [accessLog]
      format = "json"
      [accessLog.fields]
        [accessLog.fields.names]
          "ClientAddr" = "drop"
          "ClientPort" = "drop"
          "RequestPath" = "drop"
          "RequestProtocol" = "drop"
          "StartLocal" = "drop"
        [accessLog.fields.headers]
          [accessLog.fields.headers.names]
            "Etag" = "drop"
    [kubernetes]
      namespaces = ["redacted"]
    [entryPoints]
      [entryPoints.http]
        address = ":80"
        compress = true
        [entryPoints.http.proxyProtocol]
          trustedIPs = ["redacted"]
        [entryPoints.http.redirect]
          entryPoint = "https"
      [entryPoints.https]
        address = ":443"
        compress = true
        [entryPoints.https.proxyProtocol]
          trustedIPs = ["redacted"]
        [entryPoints.https.tls]
        MinVersion = "VersionTLS11"
          CipherSuites = ["TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA","TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA","TLS_RSA_WITH_AES_128_GCM_SHA256","TLS_RSA_WITH_AES_256_GCM_SHA384","TLS_RSA_WITH_AES_128_CBC_SHA","TLS_RSA_WITH_AES_256_CBC_SHA"]
          [[entryPoints.https.tls.certificates]]
            CertFile = "/etc/tls/tls.crt"
            KeyFile = "/etc/tls/tls.key"
    [api]
      dashboard = true
      [api.statistics]
        recentErrors = 10
    [ping]
    [metrics]
      [metrics.prometheus]
        buckets = [0.1,0.3,1.2,5.0]
      [metrics.statistics]
        recentErrors = 10
    [acme]
      email = "redacted"
      storage = "/acme/acme.json"
      entryPoint = "https"
      onDemand = false
      onHostRule = true
      acmeLogging = true
      [acme.httpChallenge]
        entryPoint = "http"
@emilevauge emilevauge added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. and removed status/0-needs-triage labels May 22, 2018
@emilevauge
Copy link
Member

Ping @marco-jantke, any idea on this possible wrong sum behavior ?

@m3co-code
Copy link
Contributor

m3co-code commented May 22, 2018

The metrics can derive. This is all requests that are not "routable" to any of the backends (no matching frontends) will have the counter increased for the entrypoint metric but naturally not for the backend metric. This should be basically only 404s, though. Are you sure that the requests match to a backend, so a matching frontend is present?

@ncgee
Copy link
Author

ncgee commented May 22, 2018

Yes quite sure, coincidentally I noticed this discrepancy after looking into the different metric values for 404s. One specific backend was responsible for the bulk of the 500s, with well over 8000 log lines but the metric showing a value of about 5 if I recall correctly, and without restarts or redeploys of the pod. The logs show the requests as having both a FrontendName and BackendName value.

Additionally, but perhaps this should be a different issue, I noticed that metrics do not exist for backends that haven't handled any requests, where I would expect the metrics to exist with a value of 0.

@m3co-code
Copy link
Contributor

Can you paste me one of the log lines for a request that wasn't counted in the metrics?

Regarding pre-population of metrics: we pre-poluate the backend_server_up metric, but not all the requests / retries. This would sum up to many many pre-populated metrics, it would have to iterate all combinations of HTTP status codes with HTTP methods on each backend etc. In combination with the dynamic nature of Traefik I don't think this is what we want to do.

@ncgee
Copy link
Author

ncgee commented May 23, 2018

Here is one, I had to redact some things though:

{"BackendAddr":"redacted","BackendName":"redacted","BackendURL":{"Scheme":"http","Opaque":"","User":null,"Host":"redacted","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"ClientHost":"redacted","ClientUsername":"-","DownstreamContentSize":108,"DownstreamStatus":500,"DownstreamStatusLine":"500 Internal Server Error","Duration":12766684,"FrontendName":"redacted","OriginContentSize":108,"OriginDuration":12478122,"OriginStatus":500,"OriginStatusLine":"500 Internal Server Error","Overhead":288562,"RequestAddr":"redacted","RequestContentSize":0,"RequestCount":33668,"RequestHost":"redacted","RequestLine":"GET /redacted HTTP/1.1","RequestMethod":"GET","RequestPort":"-","RetryAttempts":0,"StartUTC":"2018-05-22T08:42:04.086828695Z","downstream_Content-Type":"application/json;charset=UTF-8","downstream_Date":"Tue, 22 May 2018 08:42:04 GMT","downstream_Vary":"Accept-Encoding","level":"info","msg":"","origin_Content-Type":"application/json;charset=UTF-8","origin_Date":"Tue, 22 May 2018 08:42:04 GMT","origin_Vary":"Accept-Encoding","request_Accept":"*/*","request_Accept-Encoding":"gzip, deflate","request_Accept-Language":"en-us","request_Cache-Control":"no-cache","request_Connection":"keep-alive","request_Pragma":"no-cache","request_User-Agent":"redacted","request_X-Forwarded-For":"redacted","request_X-Forwarded-Host":"redacted","request_X-Forwarded-Port":"443","request_X-Forwarded-Proto":"https","request_X-Forwarded-Server":"redacted","request_X-Real-Ip":"redacted","time":"2018-05-22T08:42:04Z"}

I've also added some screenshots, the entrypoint metric has magically reappeared but the backend metrics are still missing since running traefik bug:

screenshot-2018-5-23-traefik_entrypoint_requests_total
screenshot-2018-5-23-traefik_backend_requests_total

@m3co-code
Copy link
Contributor

Thanks for the report. I'm going to have a deeper look into this and whether we have similar problems.

About the re-appearing metric it can be that you are affected of something that will be fixed with #3287.

@m3co-code
Copy link
Contributor

After digging into the issue I figured out that resetting metrics was not properly implemented. I pushed another commit to my PR acce4d9e533f2d7931a5b3738196a835f20b34d8 that will fix that. I'll also try to change the target of my PR to the 1.6. branch, in order to get this officially released sooner.

A bit of background for you on what's the issue here: as the configuration of Traefik is dynamic, we had to make sure to remove metrics that belong to those dynamic configurations (e.g. traefik_backend_requests_total{backend="my-backend"}) from the /metrics output, once that part of the dynamic configuration doesn't exist anymore. Otherwise, you can get wrong measurements. Note: we ensure all metrics have been scraped at least once before we remove them. In Traefik 1.6 this metric removal was based on a rather poor heuristic, given the metric wasn't tracked in the time-frame of the last 3 configuration reloads, it would get removed. This fact + that the metric value wasn't reset properly, is responsible for the re-appearing entrypoint, at least so I guess. My PR #3287 will fix that by a) fixing the bug so that removed metrics are also reset and b) it removes metrics based on the actualy configuration of Traefik and not based on an heuristic anymore.

@ncgee
Copy link
Author

ncgee commented May 25, 2018

Thanks for looking into this, would be nice to have the fix in the 1.6 branch. Coincidentally the missing backend metric (re)appeared today with the expected value, while others went missing:

screenshot-2018-5-25-traefik_backend_requests_total-magic

I suppose that makes sense given your explanation.

@ldez ldez added kind/bug/confirmed a confirmed bug (reproducible). priority/P1 need to be fixed in next release and removed kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. labels May 25, 2018
@m3co-code
Copy link
Contributor

The bug should be fixed in the latest release: https://github.com/containous/traefik/releases/tag/v1.6.3

Can you try it out and see whether it fixed your problems as well?

@ncgee
Copy link
Author

ncgee commented Jun 11, 2018

Thanks for your work on this! The metrics no longer disappear. One minor note is that Prometheus appears to have reset the counters where I would expect it to take the 'restart' into account, have the metrics themselves changed a bit as well or is this expected behaviour?

@m3co-code
Copy link
Contributor

No, the counter reset is totally expected. This is how Prometheus works. Functions like rate, irate or increase take restarts into account and calculate the proper result regardless of restarts. Having only the counter value alone is hardly of any use with Prometheus.

I'm closing the issue therefore, thanks for reporting and in case you run into more problems, let us know :-)

@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/middleware/metrics kind/bug/confirmed a confirmed bug (reproducible). priority/P1 need to be fixed in next release status/5-frozen-due-to-age
Projects
None yet
Development

No branches or pull requests

5 participants