Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(outbound): Add response metrics to policy router #3086

Merged
merged 1 commit into from
Jul 23, 2024
Merged

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Jul 23, 2024

The outbound policy router includes a requests counter that measures the number of requests dispatched to each route-backend; but this does not provide visibility into success rate or response time. Before introducing timeouts and retires on outbound routes, this change introduces visibility into per-route response metrics.

The route_request_statuses counters measure responses from the application's point of view. Once retries are introduced, this will provide visibility into the effective success rate of each route.

outbound_http_route_request_statuses_total{parent...,route...,http_status="200",error="TIMEOUT"} 0
outbound_grpc_route_request_statuses_total{parent...,route...,grpc_status="NOT_FOUND",error="TIMEOUT"} 0

A coarse histogram is introduced at this scope to track the total duration of requests dispatched to each route, covering all retries and all response stream processing:

outbound_http_route_request_duration_seconds_sum{parent...,route...} 0
outbound_http_route_request_duration_seconds_count{parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="0.05",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="0.5",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="1.0",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="10.0",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="+Inf",parent...,route...} 0

The route_backend_response_statuses counters measure the responses from individual backends. This reflects the actual success rate of each route as served by the backend services.

outbound_http_route_backend_response_statuses_total{parent...,route...,backend...,http_status="...",error="..."} 0
outbound_grpc_route_backend_response_statuses_total{parent...,route...,backend...,grpc_status="...",error="..."} 0

A slightly more detailed histogram is introduced at this scope to track the time spend processing responses from each backend (i.e. after the request has been fully dispatched):

outbound_http_route_backend_response_duration_seconds_sum{parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_count{parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.025",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.05",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.1",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.25",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.5",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="1.0",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="10.0",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="+Inf",parent...,route...,backend...} 0

Note that duration histograms omit status code labels, as they needlessly inflate metrics cardinality. The histograms that we have introduced here are generally much more constrained, as we much choose broadly applicable buckets and want to avoid cardinality explosion when many routes are used.

The outbound policy router includes a requests counter that measures the number
of requests dispatched to each route-backend; but this does not provide
visibility into success rate or response time. Before introducing timeouts and
retires on outbound routes, this change introduces visibility into per-route
response metrics.

The route_request_statuses counters measure responses from the application's
point of view. Once retries are introduced, this will provide visibility into
the _effective_ success rate of each route.

    outbound_http_route_request_statuses_total{parent...,route...,http_status="200",error="TIMEOUT"} 0
    outbound_grpc_route_request_statuses_total{parent...,route...,grpc_status="NOT_FOUND",error="TIMEOUT"} 0

A coarse histogram is introduced at this scope to track the total duration of
requests dispatched to each route, covering all retries and all response stream
processing:

    outbound_http_route_request_duration_seconds_sum{parent...,route...} 0
    outbound_http_route_request_duration_seconds_count{parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="0.05",parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="0.5",parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="1.0",parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="10.0",parent...,route...} 0
    outbound_http_route_request_duration_seconds_bucket{le="+Inf",parent...,route...} 0

The route_backend_response_statuses counters measure the responses from
individual backends. This reflects the _actual_ success rate of each route as
served by the backend services.

    outbound_http_route_backend_response_statuses_total{parent...,route...,backend...,http_status="...",error="..."} 0
    outbound_grpc_route_backend_response_statuses_total{parent...,route...,backend...,grpc_status="...",error="..."} 0

A slightly more detailed histogram is introduced at this scope to track the time
spend processing responses from each backend (i.e. after the request has been
fully dispatched):

    outbound_http_route_backend_response_duration_seconds_sum{parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_count{parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.025",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.05",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.1",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.25",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="0.5",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="1.0",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="10.0",parent...,route...,backend...} 0
    outbound_http_route_backend_response_duration_seconds_bucket{le="+Inf",parent...,route...,backend...} 0

Note that duration histograms omit status code labels, as they needlessly
inflate metrics cardinality. The histograms that we have introduced here are
generally much more constrained, as we much choose broadly applicable buckets
and want to avoid cardinality explosion when many routes are used.
@olix0r olix0r requested a review from a team as a code owner July 23, 2024 17:49
@olix0r olix0r merged commit 7c99d15 into main Jul 23, 2024
16 checks passed
@olix0r olix0r deleted the ver/http-prom branch July 23, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant