feat(outbound): Add response metrics to policy router #3086
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The outbound policy router includes a requests counter that measures the number of requests dispatched to each route-backend; but this does not provide visibility into success rate or response time. Before introducing timeouts and retires on outbound routes, this change introduces visibility into per-route response metrics.
The route_request_statuses counters measure responses from the application's point of view. Once retries are introduced, this will provide visibility into the effective success rate of each route.
A coarse histogram is introduced at this scope to track the total duration of requests dispatched to each route, covering all retries and all response stream processing:
The route_backend_response_statuses counters measure the responses from individual backends. This reflects the actual success rate of each route as served by the backend services.
A slightly more detailed histogram is introduced at this scope to track the time spend processing responses from each backend (i.e. after the request has been fully dispatched):
Note that duration histograms omit status code labels, as they needlessly inflate metrics cardinality. The histograms that we have introduced here are generally much more constrained, as we much choose broadly applicable buckets and want to avoid cardinality explosion when many routes are used.