Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation level prometheus histogram is 100% in first bucket #1636

Closed
ghost opened this issue Aug 29, 2022 · 2 comments · Fixed by #1705
Closed

Operation level prometheus histogram is 100% in first bucket #1636

ghost opened this issue Aug 29, 2022 · 2 comments · Fixed by #1705
Assignees

Comments

@ghost
Copy link

ghost commented Aug 29, 2022

Describe the bug
When looking at prometheus histogram

To Reproduce
Steps to reproduce the behavior:

  1. Enable prometheus in router
  2. Drive traffic, ensure you use operation names

Expected behavior
Histogram should distribute to buckets cleanly based on actual total operation time latency e.g.

http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.001"} 10 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.005"} 100 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.015"} 200 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.05"} 300 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.1"} 400 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.2"} 400 ..... http_request_duration_seconds_sum{operation_name="someName",status="200"} ... http_request_duration_seconds_count{operation_name="someName",status="200"} 1010

Output
http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.001"} 421216 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.005"} 421217 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.015"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.05"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.1"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.2"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.3"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.4"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="0.5"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="1"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="5"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="10"} 421218 http_request_duration_seconds_bucket{operation_name="someName",status="200",le="+Inf"} 421218 http_request_duration_seconds_sum{operation_name="someName",status="200"} 3.7173161799999983 http_request_duration_seconds_count{operation_name="someName",status="200"} 421218

Our router (0.9.4) has this config:

# Configuration of the router's HTTP server
server:
  # The socket address and port to listen on
  # (Defaults to 127.0.0.1:4000)
  listen: 0.0.0.0:8080
  # Default is /.well-known/apollo/server-health
  health_check_path: /healthCheck
  introspection: false

headers:
  all:
  - propagate:
      named: "x-encoded-user"
      rename: "x-graphql-gateway-user"

telemetry:
  metrics:
    common:
      attributes:
        from_headers:
          - named: "clientid"
            rename: "api_client"
            default: "UNKNOWN"
    prometheus:
      # Metrics accessible at URL path `/plugins/apollo.telemetry/prometheus`
      enabled: true
  tracing:
    trace_config:
      service_name: "${GRAPHQL_ROUTER_SERVICE_NAME:graphql-router}"
      service_namespace: "apollo"
      # Optional. Either a float between 0 and 1 or 'always_on' or 'always_off'
      sampler: 0.1
    propagation:
      # https://www.jaegertracing.io/ (compliant with opentracing)
      jaeger: true
    jaeger:
      collector:
        endpoint: "${JAEGER_COLLECTOR_ENDPOINT:http://jaeger-collector.observability:14268/api/traces}"

@abernix abernix modified the milestones: v1.0.0-alpha.1, v1.0.0-alpha.2 Aug 30, 2022
@bnjjj
Copy link
Contributor

bnjjj commented Sep 5, 2022

thanks @tripatti for creating this issue. I tried to reproduce but without success. On what environment are you running the router ? On docker ? I suspect the bug you have might be caused by your OS's clock or something related.

@ghost
Copy link
Author

ghost commented Sep 5, 2022

We are running on docker. One detail that is very clear is that the sub-graph specific sum is much much higher than the operation specific sum. E.g. total amount of seconds in the target sub-graph is magnitudes higher than that of the operation. In above example operation sum is 3.7 seconds, the sub-graph is truncated.

I just sampled production again and there the difference is 1000-fold. I do not think that would be caused by OS clock?

@abernix abernix modified the milestones: v1.0.0-alpha.2, v1.0.0-alpha.3 Sep 6, 2022
bnjjj added a commit that referenced this issue Sep 7, 2022
…he stream each time (#1705)

close #1636

Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
@abernix abernix removed the triage label Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants