-
Notifications
You must be signed in to change notification settings - Fork 248
A94: OTel metrics for Subchannels #485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
7bf5cb7
A94: gRPC OTel metrics for Subchannels
yashykt 3cca5d9
Add discusion thread
yashykt d27143c
Reviewer comments
yashykt 9cbb5ff
Add updated by tag to A78
yashykt f561b95
Add note on stability
yashykt 0897db6
Formatting
yashykt bee13c6
Add windows error code for connection aborted
yashykt afd47ec
Fix formatting
yashykt 062f673
Add security level label
yashykt ea53aab
Reviewer comments
yashykt 2f47f5f
Reviewer comments
yashykt f0572a0
Reviewer comment
yashykt 6c8de90
Fix github id
yashykt 7c664c2
Fix link
yashykt d846946
Reviewer comment
yashykt 61e4808
Reviewer comments
yashykt 9bbea0e
Reviewer comment
yashykt 35dd385
Move status to ready for implementation
yashykt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
## A94: OTel metrics for Subchannels | ||
|
||
* Author(s): Yash Tibrewal (@yashykt) | ||
* Approver: Mark Roth (@markdroth), Eric Anderson (@ejona86), Doug Fawley | ||
(@dfawley) | ||
* Status: Ready for Implementation | ||
* Implemented in: | ||
* Last updated: 2025-08-12 | ||
* Discussion at: https://groups.google.com/g/grpc-io/c/iMdK7r4E5tU | ||
|
||
## Abstract | ||
|
||
Introduce OpenTelemetry metrics for subchannels. These metrics will replace the | ||
existing pick-first metrics. | ||
|
||
## Background | ||
|
||
In [A78], metrics for PickFirst load-balancing policy were proposed that provide | ||
observability on disconnections for subchannels and connection attempts made for | ||
those subchannels. These metrics do not currently contain information on the | ||
reason for disconnection, the xds locality or the cluster information. | ||
|
||
[A89] is a proposal to introduce a new optional label `grpc.lb.backend_service` | ||
to client-side per-attempt metrics. This label has xds cluster information. | ||
|
||
### Related Proposals: | ||
yashykt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* [A8]: Client-side Keepalive | ||
* [A18]: TCP User Timeout | ||
* [A61]: IPv4 and IPv6 Dualstack Backend Support | ||
* [A66]: OpenTelemetry Metrics | ||
* [A74]: xDS Config Tears | ||
* [A78]: gRPC OTel Metrics for WRR, Pick First, and XdsClient | ||
* [A79]: Non-per-call Metrics Architecture | ||
* [A89]: Backend Service Metric Label | ||
* [L62]: gRPC security level negotiation between call credentials and channels | ||
|
||
[A8]: A8-client-side-keepalive.md | ||
[A18]: A18-tcp-user-timeout.md | ||
[A61]: A61-IPv4-IPv6-dualstack-backends.md | ||
[A66]: A66-otel-stats.md | ||
[A74]: A74-xds-config-tears.md | ||
[A78]: A78-grpc-metrics-wrr-pf-xds.md | ||
[A79]: A79-non-per-call-metrics-architecture.md | ||
[A89]: A89-backend-service-metric-label.md | ||
[L62]: L62-core-call-credential-security-level.md | ||
|
||
## Proposal | ||
|
||
Move the existing pick-first metrics to subchannel metrics | ||
(`grpc.lb.pick_first.*` to `grpc.subchannel.*`) with the addition of optional | ||
labels as shown below - | ||
|
||
Metric Name | Type | Unit | Labels | Description | ||
------------------------------------------------------------------------------------------------------ | -------------- | --------------- | -------------------------------------------------------------------------------------------------------------- | ----------- | ||
grpc.subchannel.disconnections (Old - grpc.lb.pick_first.disconnections) | Counter | {disconnection} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional), grpc.disconnect_error (optional) | Number of times the selected subchannel becomes disconnected. | ||
grpc.subchannel.connection_attempts_succeeded (Old - grpc.lb.pick_first.connection_attempts_succeeded) | Counter | {attempt} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of successful connection attempts. | ||
grpc.subchannel.connection_attempts_failed (Old - grpc.lb.pick_first.connection_attempts_failed) | Counter | {attempt} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of failed connection attempts. | ||
grpc.subchannel.open_connections | UpDown Counter | {connection} | grpc.target, grpc.security_level (optional), grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of open connections. | ||
|
||
If we end up discarding connection attempts as we do with the “happy eyeballs” | ||
algorithm (as per [A61]), we should not record the connection attempt or the | ||
disconnection. | ||
|
||
Implementations that have already implemented the pick-first metrics should give | ||
enough time for users to transition to the new metrics. For example, | ||
implementations should report both the old pick-first metrics and the new | ||
subchannel metrics for 2 releases, and then remove the old pick-first metrics. | ||
|
||
Label Name | Disposition | Description | ||
----------------------- | ----------- | ----------- | ||
grpc.target | Required | Indicates the target of the gRPC channel (defined in [A66].) | ||
grpc.lb.backend_service | Optional | The backend service to which the RPC was routed (defined in [A89].) | ||
grpc.lb.locality | Optional | The locality to which the traffic is being sent. This will be set to the resolver attribute passed down from the weighted_target policy, or the empty string if the resolver attribute is unset (defined in [A78].) | ||
grpc.disconnect_error | Optional | Reason for disconnection. | ||
grpc.security_level | Optional | Denotes the security level of the connection. Allowed values - "none", "integrity_only" and "privacy_and_integrity". | ||
AgraVator marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The subchannel needs to be passed attributes with the values for the | ||
`grpc.lb.backend_service` and `grpc.lb.locality` labels (defined in [A89] and | ||
[A78] respectively). This implies that the subchannel will be recreated when | ||
these attributes change. Since currently, only xDS is using these labels, the | ||
attributes will be set for each endpoint or address by cds (post-[A74]) or | ||
xds_cluster_resolver (pre-[A74]) LB policies. | ||
|
||
List of allowed values for `grpc.disconnect_error` - | ||
|
||
Error string | Description | ||
-------------------- | ----------- | ||
GOAWAY <ERROR_CODE> | HTTP2 GOAWAY frame with error code for example (“GOAWAY NO_ERROR”, “GOAWAY PROTOCOL_ERROR”, “GOAWAY ENHANCE_YOUR_CALM”). The list of error codes is available in [RFC 9113](https://www.rfc-editor.org/rfc/rfc9113.html#name-error-codes). | ||
subchannel shutdown | The subchannel was shutdown. This can happen due to reasons such as the parent channel shutting down, channel becoming idle, the load balancing policy changing due to a resolver update, or a change in list of endpoint addresses. | ||
connection reset | Connection was reset (eg. ECONNRESET, WSAECONNERESET.) | ||
connection timed out | Connection timed out (eg. ETIMEDOUT, WSAETIMEDOUT), also includes connections closed due to [A8]: gRPC keepalives. | ||
connection aborted | Connection was aborted (eg. ECONNABORTED, WSAECONNABORTED.) | ||
socket error | Any socket error not covered by “connection reset”, “connection timed out” and “connection aborted”. Implementations that are not able to differentiate between the different socket error codes should also use this. | ||
unknown | Catch-all for all other reasons. | ||
|
||
For a given connection, there can be multiple reasons reported to the subchannel | ||
for disconnection. For example, a connection could have seen a GOAWAY frame with | ||
`ENHANCE_YOUR_CALM` and then a socket error Broken Pipe. In such cases, the | ||
first seen reason should be chosen, `GOAWAY ENHANCE_YOUR_CALM` in this case. | ||
|
||
We might add more error cases to this in the future. | ||
|
||
### Stability | ||
|
||
As recommended by [A79], these metrics will start off as experimental, and hence | ||
off-by-default. The decision on whether these metrics will be on-by-default or | ||
off-by-default on de-experimentalization will be made at the same time as the | ||
de-experimentalization. | ||
|
||
## Rationale | ||
|
||
### Renaming pick-first metrics | ||
markdroth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The existing pick-first metrics provides stats on subchannel disconnections and | ||
connection attempts as viewed from the perspective of the pick-first lb policy. | ||
[A61] made pick-first lb policy the universal leaf policy. For users unfamiliar | ||
with this, it will come as a surprise when metrics for pick-first lb policy are | ||
populated when round_robin lb policy is configured (for example). Additionally, | ||
the pick-first metrics are defined from the perspective of the channel. This | ||
means that if subchannels are shared between multiple channels (as is the case | ||
for gRPC Core and its wrapped languages - C++, Python), we will double-count the | ||
disconnections/connection attempts. | ||
|
||
Renaming/moving the pick-first metrics to subchannel makes this more intuitive, | ||
and fixes the double-counting problem. | ||
|
||
### Metric for open connections | ||
|
||
Moving the metrics down to subchannel potentially allows us to calculate the | ||
number of open connections by subtracting `grpc.subchannel.disconnections` from | ||
`grpc.subchannel.connection_attempts_succeeded`. This method does not work for | ||
exporters recording counters per period in a way that does not allow for a | ||
simple subtraction of the two counters | ||
(https://github.com/grpc/grpc/issues/34886). | ||
|
||
Adding an explicit metric that records the number of open connections avoids | ||
this. | ||
|
||
### Combining connection timeouts and keepalives into a single disconnection error | ||
|
||
We expect most implementations of [A8] to also set the POSIX socket option | ||
`TCP_USER_TIMEOUT` with the same timeout value as stated in [A18]. As such, in | ||
cases where the connection is broken, the keepalive timeout will race with | ||
sockets being closed due to `TCP_USER_TIMEOUT`. Since the motive of the two | ||
timers is essentially the same, we choose to combine them into a single error, | ||
instead of trying to differentiate between them. | ||
|
||
## Implementation | ||
|
||
TBD |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.