-
Notifications
You must be signed in to change notification settings - Fork 248
A96: OTel Metrics for Retries #488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
grpc.client.call.retry_delay | Histogram | s | grpc.method (required), grpc.target (required) | Total time of delay while there is no active attempt during the client call. | ||
Metric Name | Type | Unit | Labels | Description | ||
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | ----------- | ||
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need more than 5 as the upper bound here. When we implement estubs retries, there are modes where there is no limit on the number of retry attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we end up changing the limit on retries, we can update the advice here.
Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.
Metric Name | Type | Unit | Labels | Description | ||
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | ----------- | ||
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. | ||
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, there is no upper bound on the number of transparent retry attempts if each attempt never actually goes out on the wire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a 10 taking that in consideration. I think this should be enough. I'm not sure it makes sense to distinguish between cases where the number if larger than 10. (I'm willing to change the boundaries here if there are strong opinions. These are just recommendations anyway.)
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | ----------- | ||
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. | ||
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10]. | ||
grpc.client.call.hedges | Histogram | {hedge} | grpc.method (required), grpc.target (required) | Number of hedges during the client call. If there were no hedges, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In estubs, the upper bound for the number of hedged requests can be quite large, so I think we'll want to support larger numbers here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we end up changing the limit on retries, we can update the advice here.
Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've talked about this before, but I don't understand how expontential histograms can possibly work for aggregation. And in any case, we don't support them in the non-per-call metrics APIs, so I don't see how we can recommend that here.
Exponential Histograms are just a type of histograms. Our APIs do not need anything more to support exponential histograms. The OpenTelemetry SDK user can simply provide the buckets to use when creating the histogram view.
expontential histograms can possibly work for aggregation
Aggregation for histograms, which in most cases is calculating percentiles (50% / 95% / 99 %) is always approximate since we no longer have the exact values.
grpc.client.call.retry_delay | Histogram | s | grpc.method (required), grpc.target (required) | Total time of delay while there is no active attempt during the client call. | ||
Metric Name | Type | Unit | Labels | Description | ||
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | ----------- | ||
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we end up changing the limit on retries, we can update the advice here.
Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.
Metric Name | Type | Unit | Labels | Description | ||
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | ----------- | ||
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. | ||
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a 10 taking that in consideration. I think this should be enough. I'm not sure it makes sense to distinguish between cases where the number if larger than 10. (I'm willing to change the boundaries here if there are strong opinions. These are just recommendations anyway.)
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | ----------- | ||
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. | ||
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10]. | ||
grpc.client.call.hedges | Histogram | {hedge} | grpc.method (required), grpc.target (required) | Number of hedges during the client call. If there were no hedges, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we end up changing the limit on retries, we can update the advice here.
Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.
No description provided.