Skip to content
98 changes: 98 additions & 0 deletions A96-retry-otel-stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
## A96: OTel Metrics for Retries

* Author: Yash Tibrewal (@yashykt)
* Approver(s): Mark Roth (@markdroth), Eric Anderson (@ejona86), Doug Fawley
(@dfawley)
* Status: Ready for Implementation
* Implemented in:
* Last updated: 2025-07-01
* Discussion at: https://groups.google.com/g/grpc-io/c/bFUHkcBA9cw

## Abstract

Propose OpenTelemetry metrics for gRPC retries.

## Background

Since OpenCensus has been sunsetted in favor of OpenTelemetry, gRPC has been
developing its own OpenTelemetry plugin that is meant to replace the OpenCensus
plugin ([A66]). This document proposes the OpenTelemetry version of the retry
metrics originally proposed in [A45].

### Related Proposals:

* [A45]: Exposing OpenCensus Metrics and Tracing for gRPC retry
* [A66]: OpenTelemetry Metrics
* [A79]: Non-per-call Metrics Architecture

[A45]: A45-retry-stats.md
[A66]: A66-otel-stats.md
[A79]: A79-non-per-call-metrics-architecture.md

## Proposal

Metric Name | Type | Unit | Labels | Description
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | -----------
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need more than 5 as the upper bound here. When we implement estubs retries, there are modes where there is no limit on the number of retry attempts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we end up changing the limit on retries, we can update the advice here.

Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.

grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, there is no upper bound on the number of transparent retry attempts if each attempt never actually goes out on the wire.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a 10 taking that in consideration. I think this should be enough. I'm not sure it makes sense to distinguish between cases where the number if larger than 10. (I'm willing to change the boundaries here if there are strong opinions. These are just recommendations anyway.)

grpc.client.call.hedges | Histogram | {hedge} | grpc.method (required), grpc.target (required) | Number of hedges during the client call. If there were no hedges, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In estubs, the upper bound for the number of hedged requests can be quite large, so I think we'll want to support larger numbers here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we end up changing the limit on retries, we can update the advice here.

Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.

grpc.client.call.retry_delay | Histogram | s | grpc.method (required), grpc.target (required) | Total time of delay while there is no active attempt during the client call. Recommended to use the latency bucket boundaries defined in [A66].

The labels `grpc.method` and `grpc.target` have been defined in [A66].

These metrics are recorded at the end of the call utilizing the `CallTracer`
approach (also defined in [A66]).

Note that even if a client's configured policy doesn't specify a delay between
call attempts (eg. a non-fatal status code on a hedged request), it is possible
that the CallTracer records a non-zero `grpc.client.call.retry_delay` due to the
internal overhead of the retry mechanism.

### Stability

As recommended by [A79], these metrics will start off as experimental, and hence
off-by-default. The decision on whether these metrics will be on-by-default or
off-by-default on de-experimentalization will be made at the same time as the
de-experimentalization.

## Rationale

OpenCensus Metric | Equivalent OpenTelemetry Metric
------------------------------------------- | -------------------------------
grpc.io/client/retries_per_call | grpc.client.call.retries
grpc.io/client/retries | Sum of grpc.client.call.retries
grpc.io/client/transparent_retries_per_call | grpc.client.call.transparent_retries
grpc.io/client/transparent_retries | Sum of grpc.client.call.retries
grpc.io/client/retry_delay_per_call | grpc.client.call.retry_delay

The names of the metrics proposed for the OpenTelemetry version follows the
general OpenTelemetry
[semantic conventions](https://opentelemetry.io/docs/specs/semconv/general/metrics/),
[naming guidelines](https://opentelemetry.io/docs/specs/semconv/general/naming/),
the existing per-call gRPC OpenTelemetry metrics (proposed in [A66]) and the
gRPC OpenTelemetry metric instrument naming conventions (proposed in [A79]).

### Separate metrics for retries and hedges

The OpenCensus version of the retry metrics combined the number of retry and
hedging attempts under a single OpenCensus measure
`grpc.io/client/retries_per_call` and had a separate measure for transparent
retries `grpc.io/client/transparent_retries_per_call`. Since different
channels/clients can be configured differently, for example, some with a retry
policy while others with a hedging policy, it is useful to differentiate between
retry attempts and hedging attempts as well. This would also help in the future
if we allow both retry and hedging policies to be configured on a client at the
same time.

### Label addition - grpc.target

The OpenCensus version did not have an equivalent `grpc.target` label on the
retry metrics. This label has been added to the OpenTelemetry version keeping in
line with the other per-call metrics defined in [A66].

## Implementation

* C++ - Will be implemented by @yashykt
* Java - Will be implemented by @agravator
* Go - TBD
* Python - TBD