-
Notifications
You must be signed in to change notification settings - Fork 248
A96: OTel Metrics for Retries #488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cbe7612
745d70d
8577beb
8d00da5
72c266a
5e644c3
173e638
699029c
a6a4084
51b80ad
958d577
1d36e23
711c675
6a712f7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
## A96: OTel Metrics for Retries | ||
|
||
* Author: Yash Tibrewal (@yashykt) | ||
* Approver(s): Mark Roth (@markdroth), Eric Anderson (@ejona86), Doug Fawley | ||
(@dfawley) | ||
* Status: Ready for Implementation | ||
* Implemented in: | ||
* Last updated: 2025-07-01 | ||
* Discussion at: https://groups.google.com/g/grpc-io/c/bFUHkcBA9cw | ||
|
||
## Abstract | ||
|
||
Propose OpenTelemetry metrics for gRPC retries. | ||
|
||
## Background | ||
|
||
Since OpenCensus has been sunsetted in favor of OpenTelemetry, gRPC has been | ||
developing its own OpenTelemetry plugin that is meant to replace the OpenCensus | ||
plugin ([A66]). This document proposes the OpenTelemetry version of the retry | ||
metrics originally proposed in [A45]. | ||
|
||
### Related Proposals: | ||
|
||
* [A45]: Exposing OpenCensus Metrics and Tracing for gRPC retry | ||
* [A66]: OpenTelemetry Metrics | ||
* [A79]: Non-per-call Metrics Architecture | ||
|
||
[A45]: A45-retry-stats.md | ||
[A66]: A66-otel-stats.md | ||
[A79]: A79-non-per-call-metrics-architecture.md | ||
|
||
## Proposal | ||
|
||
Metric Name | Type | Unit | Labels | Description | ||
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | ----------- | ||
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might need more than 5 as the upper bound here. When we implement estubs retries, there are modes where there is no limit on the number of retry attempts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if we end up changing the limit on retries, we can update the advice here. Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be. |
||
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10]. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In principle, there is no upper bound on the number of transparent retry attempts if each attempt never actually goes out on the wire. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a 10 taking that in consideration. I think this should be enough. I'm not sure it makes sense to distinguish between cases where the number if larger than 10. (I'm willing to change the boundaries here if there are strong opinions. These are just recommendations anyway.) |
||
grpc.client.call.hedges | Histogram | {hedge} | grpc.method (required), grpc.target (required) | Number of hedges during the client call. If there were no hedges, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. | ||
dfawley marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In estubs, the upper bound for the number of hedged requests can be quite large, so I think we'll want to support larger numbers here too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if we end up changing the limit on retries, we can update the advice here. Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be. |
||
grpc.client.call.retry_delay | Histogram | s | grpc.method (required), grpc.target (required) | Total time of delay while there is no active attempt during the client call. Recommended to use the latency bucket boundaries defined in [A66]. | ||
|
||
The labels `grpc.method` and `grpc.target` have been defined in [A66]. | ||
|
||
These metrics are recorded at the end of the call utilizing the `CallTracer` | ||
approach (also defined in [A66]). | ||
|
||
Note that even if a client's configured policy doesn't specify a delay between | ||
call attempts (eg. a non-fatal status code on a hedged request), it is possible | ||
that the CallTracer records a non-zero `grpc.client.call.retry_delay` due to the | ||
internal overhead of the retry mechanism. | ||
|
||
### Stability | ||
|
||
As recommended by [A79], these metrics will start off as experimental, and hence | ||
off-by-default. The decision on whether these metrics will be on-by-default or | ||
off-by-default on de-experimentalization will be made at the same time as the | ||
de-experimentalization. | ||
|
||
## Rationale | ||
|
||
OpenCensus Metric | Equivalent OpenTelemetry Metric | ||
------------------------------------------- | ------------------------------- | ||
grpc.io/client/retries_per_call | grpc.client.call.retries | ||
grpc.io/client/retries | Sum of grpc.client.call.retries | ||
grpc.io/client/transparent_retries_per_call | grpc.client.call.transparent_retries | ||
grpc.io/client/transparent_retries | Sum of grpc.client.call.retries | ||
grpc.io/client/retry_delay_per_call | grpc.client.call.retry_delay | ||
|
||
The names of the metrics proposed for the OpenTelemetry version follows the | ||
general OpenTelemetry | ||
[semantic conventions](https://opentelemetry.io/docs/specs/semconv/general/metrics/), | ||
[naming guidelines](https://opentelemetry.io/docs/specs/semconv/general/naming/), | ||
the existing per-call gRPC OpenTelemetry metrics (proposed in [A66]) and the | ||
gRPC OpenTelemetry metric instrument naming conventions (proposed in [A79]). | ||
|
||
### Separate metrics for retries and hedges | ||
|
||
The OpenCensus version of the retry metrics combined the number of retry and | ||
hedging attempts under a single OpenCensus measure | ||
`grpc.io/client/retries_per_call` and had a separate measure for transparent | ||
retries `grpc.io/client/transparent_retries_per_call`. Since different | ||
channels/clients can be configured differently, for example, some with a retry | ||
policy while others with a hedging policy, it is useful to differentiate between | ||
retry attempts and hedging attempts as well. This would also help in the future | ||
if we allow both retry and hedging policies to be configured on a client at the | ||
same time. | ||
|
||
### Label addition - grpc.target | ||
|
||
The OpenCensus version did not have an equivalent `grpc.target` label on the | ||
retry metrics. This label has been added to the OpenTelemetry version keeping in | ||
line with the other per-call metrics defined in [A66]. | ||
|
||
## Implementation | ||
|
||
* C++ - Will be implemented by @yashykt | ||
* Java - Will be implemented by @agravator | ||
* Go - TBD | ||
* Python - TBD |
Uh oh!
There was an error while loading. Please reload this page.