Operation level prometheus histogram is 100% in first bucket #1636

ghost · 2022-08-29T07:19:16Z

Describe the bug
When looking at prometheus histogram

To Reproduce
Steps to reproduce the behavior:

Enable prometheus in router
Drive traffic, ensure you use operation names

Expected behavior
Histogram should distribute to buckets cleanly based on actual total operation time latency e.g.

http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.001"} 10 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.005"} 100 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.015"} 200 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.05"} 300 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.1"} 400 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.2"} 400 ..... http_request_duration_seconds_sum{operation_name="someName",status="200"} ... http_request_duration_seconds_count{operation_name="someName",status="200"} 1010

Output
http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.001"} 421216 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.005"} 421217 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.015"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.05"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.1"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.2"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.3"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.4"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.5"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="1"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="5"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="10"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="+Inf"} 421218 http_request_duration_seconds_sum{operation_name="someName",status="200"} 3.7173161799999983 http_request_duration_seconds_count{operation_name="someName",status="200"} 421218

Our router (0.9.4) has this config:

# Configuration of the router's HTTP server
server:
  # The socket address and port to listen on
  # (Defaults to 127.0.0.1:4000)
  listen: 0.0.0.0:8080
  # Default is /.well-known/apollo/server-health
  health_check_path: /healthCheck
  introspection: false

headers:
  all:
  - propagate:
      named: "x-encoded-user"
      rename: "x-graphql-gateway-user"

telemetry:
  metrics:
    common:
      attributes:
        from_headers:
          - named: "clientid"
            rename: "api_client"
            default: "UNKNOWN"
    prometheus:
      # Metrics accessible at URL path `/plugins/apollo.telemetry/prometheus`
      enabled: true
  tracing:
    trace_config:
      service_name: "${GRAPHQL_ROUTER_SERVICE_NAME:graphql-router}"
      service_namespace: "apollo"
      # Optional. Either a float between 0 and 1 or 'always_on' or 'always_off'
      sampler: 0.1
    propagation:
      # https://www.jaegertracing.io/ (compliant with opentracing)
      jaeger: true
    jaeger:
      collector:
        endpoint: "${JAEGER_COLLECTOR_ENDPOINT:http://jaeger-collector.observability:14268/api/traces}"

The text was updated successfully, but these errors were encountered:

bnjjj · 2022-09-05T13:26:10Z

thanks @tripatti for creating this issue. I tried to reproduce but without success. On what environment are you running the router ? On docker ? I suspect the bug you have might be caused by your OS's clock or something related.

ghost · 2022-09-05T18:35:40Z

We are running on docker. One detail that is very clear is that the sub-graph specific sum is much much higher than the operation specific sum. E.g. total amount of seconds in the target sub-graph is magnitudes higher than that of the operation. In above example operation sum is 3.7 seconds, the sub-graph is truncated.

I just sampled production again and there the difference is 1000-fold. I do not think that would be caused by OS clock?

…he stream each time (#1705) close #1636 Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>

ghost added raised by user triage labels Aug 29, 2022

o0Ignition0o added the bug label Aug 29, 2022

abernix modified the milestones: v1.0.0-alpha.1, v1.0.0-alpha.2 Aug 30, 2022

bnjjj mentioned this issue Sep 6, 2022

fix(metrics): compute time before receiving the first response from the stream each time #1705

Merged

abernix modified the milestones: v1.0.0-alpha.2, v1.0.0-alpha.3 Sep 6, 2022

bnjjj closed this as completed in #1705 Sep 7, 2022

bnjjj added a commit that referenced this issue Sep 7, 2022

fix(metrics): compute time before receiving the first response from t…

ac0da7a

…he stream each time (#1705) close #1636 Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>

abernix assigned bnjjj Sep 7, 2022

abernix removed the triage label Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operation level prometheus histogram is 100% in first bucket #1636

Operation level prometheus histogram is 100% in first bucket #1636

ghost commented Aug 29, 2022 •

edited by ghost

Loading

bnjjj commented Sep 5, 2022

ghost commented Sep 5, 2022

Operation level prometheus histogram is 100% in first bucket #1636

Operation level prometheus histogram is 100% in first bucket #1636

Comments

ghost commented Aug 29, 2022 • edited by ghost Loading

bnjjj commented Sep 5, 2022

ghost commented Sep 5, 2022

ghost commented Aug 29, 2022 •

edited by ghost

Loading