feat(outbound): Add response metrics to policy router #3086

olix0r · 2024-07-23T17:49:55Z

The outbound policy router includes a requests counter that measures the number of requests dispatched to each route-backend; but this does not provide visibility into success rate or response time. Before introducing timeouts and retires on outbound routes, this change introduces visibility into per-route response metrics.

The route_request_statuses counters measure responses from the application's point of view. Once retries are introduced, this will provide visibility into the effective success rate of each route.

outbound_http_route_request_statuses_total{parent...,route...,http_status="200",error="TIMEOUT"} 0
outbound_grpc_route_request_statuses_total{parent...,route...,grpc_status="NOT_FOUND",error="TIMEOUT"} 0

A coarse histogram is introduced at this scope to track the total duration of requests dispatched to each route, covering all retries and all response stream processing:

outbound_http_route_request_duration_seconds_sum{parent...,route...} 0
outbound_http_route_request_duration_seconds_count{parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="0.05",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="0.5",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="1.0",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="10.0",parent...,route...} 0
outbound_http_route_request_duration_seconds_bucket{le="+Inf",parent...,route...} 0

The route_backend_response_statuses counters measure the responses from individual backends. This reflects the actual success rate of each route as served by the backend services.

outbound_http_route_backend_response_statuses_total{parent...,route...,backend...,http_status="...",error="..."} 0
outbound_grpc_route_backend_response_statuses_total{parent...,route...,backend...,grpc_status="...",error="..."} 0

A slightly more detailed histogram is introduced at this scope to track the time spend processing responses from each backend (i.e. after the request has been fully dispatched):

outbound_http_route_backend_response_duration_seconds_sum{parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_count{parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.025",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.05",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.1",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.25",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="0.5",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="1.0",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="10.0",parent...,route...,backend...} 0
outbound_http_route_backend_response_duration_seconds_bucket{le="+Inf",parent...,route...,backend...} 0

Note that duration histograms omit status code labels, as they needlessly inflate metrics cardinality. The histograms that we have introduced here are generally much more constrained, as we much choose broadly applicable buckets and want to avoid cardinality explosion when many routes are used.

The outbound policy router includes a requests counter that measures the number of requests dispatched to each route-backend; but this does not provide visibility into success rate or response time. Before introducing timeouts and retires on outbound routes, this change introduces visibility into per-route response metrics. The route_request_statuses counters measure responses from the application's point of view. Once retries are introduced, this will provide visibility into the _effective_ success rate of each route. outbound_http_route_request_statuses_total{parent...,route...,http_status="200",error="TIMEOUT"} 0 outbound_grpc_route_request_statuses_total{parent...,route...,grpc_status="NOT_FOUND",error="TIMEOUT"} 0 A coarse histogram is introduced at this scope to track the total duration of requests dispatched to each route, covering all retries and all response stream processing: outbound_http_route_request_duration_seconds_sum{parent...,route...} 0 outbound_http_route_request_duration_seconds_count{parent...,route...} 0 outbound_http_route_request_duration_seconds_bucket{le="0.05",parent...,route...} 0 outbound_http_route_request_duration_seconds_bucket{le="0.5",parent...,route...} 0 outbound_http_route_request_duration_seconds_bucket{le="1.0",parent...,route...} 0 outbound_http_route_request_duration_seconds_bucket{le="10.0",parent...,route...} 0 outbound_http_route_request_duration_seconds_bucket{le="+Inf",parent...,route...} 0 The route_backend_response_statuses counters measure the responses from individual backends. This reflects the _actual_ success rate of each route as served by the backend services. outbound_http_route_backend_response_statuses_total{parent...,route...,backend...,http_status="...",error="..."} 0 outbound_grpc_route_backend_response_statuses_total{parent...,route...,backend...,grpc_status="...",error="..."} 0 A slightly more detailed histogram is introduced at this scope to track the time spend processing responses from each backend (i.e. after the request has been fully dispatched): outbound_http_route_backend_response_duration_seconds_sum{parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_count{parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_bucket{le="0.025",parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_bucket{le="0.05",parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_bucket{le="0.1",parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_bucket{le="0.25",parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_bucket{le="0.5",parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_bucket{le="1.0",parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_bucket{le="10.0",parent...,route...,backend...} 0 outbound_http_route_backend_response_duration_seconds_bucket{le="+Inf",parent...,route...,backend...} 0 Note that duration histograms omit status code labels, as they needlessly inflate metrics cardinality. The histograms that we have introduced here are generally much more constrained, as we much choose broadly applicable buckets and want to avoid cardinality explosion when many routes are used.

olix0r requested a review from a team as a code owner July 23, 2024 17:49

olix0r merged commit 7c99d15 into main Jul 23, 2024
16 checks passed

olix0r deleted the ver/http-prom branch July 23, 2024 18:16

adleong mentioned this pull request Dec 13, 2024

linkerd viz stat-outbound reports incorrect latencies linkerd/linkerd2#13483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(outbound): Add response metrics to policy router #3086

feat(outbound): Add response metrics to policy router #3086

olix0r commented Jul 23, 2024

feat(outbound): Add response metrics to policy router #3086

feat(outbound): Add response metrics to policy router #3086

Conversation

olix0r commented Jul 23, 2024