From 7bf5cb7e346e3f73c9af21719c3795637567d195 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 19 Mar 2025 02:05:13 +0000 Subject: [PATCH 01/17] A94: gRPC OTel metrics for Subchannels --- A94-subchannel-otel-metrics.md | 128 +++++++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) create mode 100644 A94-subchannel-otel-metrics.md diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md new file mode 100644 index 000000000..875ffa95d --- /dev/null +++ b/A94-subchannel-otel-metrics.md @@ -0,0 +1,128 @@ +## A94: OTel metrics for Subchannels + +* Author(s): Yash Tibrewal (yashkt@) +* Approver: Mark Roth (@markdroth) +* Status: Draft +* Implemented in: +* Last updated: Mar 14, 2025 +* Discussion at: (filled after thread exists) + +## Abstract + +Introduce metrics for subchannels. These metrics will replace the existing +pick-first metrics. + +## Background + +In [A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient], metrics for +PickFirst load-balancing policy were proposed that provide observability on +disconnections for subchannels and connection attempts made for those +subchannels. These metrics do not currently contain information on the reason +for disconnection, the xds locality or the cluster information. + +[A89: Backend Service Metric Label](https://github.com/grpc/proposal/pull/471) +is a proposal to introduce a new optional label `grpc.lb.backend_service` to +client-side per-attempt metrics. This label has xds cluster information. + +### Related Proposals: + +* [A66: OpenTelemetry Metrics] +* [A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient] +* [A79: Non-per-call Metrics Architecture] +* [A89: Backend Service Metric Label] + +[A66: OpenTelemetry Metrics]: A66-otel-stats.md +[A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient]: A78-grpc-metrics-wrr-pf-xds.md +[A79: Non-per-call Metrics Architecture]: A79-non-per-call-metrics-architecture.md +[A89: Backend Service Metric Label]: https://github.com/grpc/proposal/pull/471 + +## Proposal + +Move the existing pick-first metrics to subchannel metrics +(`grpc.lb.pick_first.*` to `grpc.subchannel.*`) with the addition of optional +labels as shown below - + +Metric Name | Type | Unit | Labels | Description +------------------------------------------------------------------------------------------------------ | -------------- | --------------- | -------------------------------------------------------------------------------------------------------------- | ----------- +grpc.subchannel.disconnections (Old - grpc.lb.pick_first.disconnections) | Counter | {disconnection} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional), grpc.disconnect_error (optional) | Number of times the selected subchannel becomes disconnected. +grpc.subchannel.connection_attempts_succeeded (Old - grpc.lb.pick_first.connection_attempts_succeeded) | Counter | {attempt} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of successful connection attempts. +grpc.subchannel.connection_attempts_failed (Old - grpc.lb.pick_first.connection_attempts_failed) | Counter | {attempt} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of failed connection attempts. +grpc.subchannel.open_connections | UpDown Counter | {connection} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of open connections. + +If we end up discarding connection attempts as we do with the “happy eyeballs” +algorithm (as per +[A61: IPv4 and IPv6 Dualstack Backend Support](A61-IPv4-IPv6-dualstack-backends.md)), +we should not record the connection attempt or the disconnection. + +Implementations that have already implemented the pick-first metrics should give +enough time for users to transition to the new metrics. For example, +implementations should report both the old pick-first metrics and the new +subchannel metrics for 2 releases, and then remove the old pick-first metrics. + +Label Name | Disposition | Description +----------------------- | ----------- | ----------- +grpc.target | Required | Indicates the target of the gRPC channel (defined in [A66](A66-otel-stats.md).) +grpc.lb.backend_service | Optional | The backend service to which the RPC was routed (defined in [A89](https://github.com/grpc/proposal/pull/471)) +grpc.lb.locality | Optional | The locality to which the traffic is being sent. This will be set to the resolver attribute passed down from the weighted_target policy, or the empty string if the resolver attribute is unset. (defined in [A78](A78-grpc-metrics-wrr-pf-xds.md)) +grpc.disconnect_error | Optional | Reason for disconnection + +List of allowed values for `grpc.disconnect_error` - + +Error string | Description +-------------------- | ----------- +GOAWAY | HTTP2 GOAWAY frame with error code for example (“GOAWAY NO_ERROR”, “GOAWAY PROTOCOL_ERROR”, “GOAWAY ENHANCE_YOUR_CALM”). The list of error codes is available in [RFC 9113](https://www.rfc-editor.org/rfc/rfc9113.html#name-error-codes). +subchannel shutdown | The subchannel was shutdown. This can happen due to reasons such as the parent channel shutting down, channel becoming idle, the load balancing policy changing due to a resolver update or a change in list of endpoint addresses. +connection reset | Connection was reset (eg. ECONNRESET, WSAECONNERESET) +connection timed out | Connection timed out, also includes connections closed due to gRPC keepalives. +connection aborted | Connection was aborted +socket error | Any socket error not covered by “connection reset”, “connection timed out” and “connection aborted”. Implementations that are not able to differentiate between the different socket error codes should also use this. +unknown | Catch-all for all other reasons. + +For a given connection, there can be multiple reasons reported to the subchannel +for disconnection. For example, a connection could have seen a GOAWAY frame with +`ENHANCE_YOUR_CALM` and then a socket error Broken Pipe. In such cases, the +first seen reason should be chosen, `GOAWAY ENHANCE_YOUR_CALM` in this case. + +We might add more error cases to this in the future. + +## Rationale + +### Renaming pick-first metrics + +The existing pick-first metrics provides stats on subchannel disconnections and +connection attempts as viewed from the perspective of the pick-first lb policy. +[A61: IPv4 and IPv6 Dualstack Backend Support](A61-IPv4-IPv6-dualstack-backends.md) +made pick-first lb policy the universal leaf policy. For users unfamiliar with +this, it will come as a surprise when metrics for pick-first lb policy are +populated when round_robin lb policy is configured (for example). Additionally, +the pick-first metrics are defined from the perspective of the channel. This +means that if subchannels are shared between multiple channels (as is the case +for gRPC Core and its wrapped languages - C++, Python), we will double-count the +disconnections/connection attempts. + +Renaming/moving the pick-first metrics to subchannel makes this more intuitive, +and fixes the double-counting problem. + +### Metric for open connections + +Moving the metrics down to subchannel potentially allows us to calculate the +number of open connections by subtracting `grpc.subchannel.disconnections` from +`grpc.subchannel.connection_attempts_succeeded`. This method does not work for +exporters recording counters per period in a way that does not allow for a +simple subtraction of the two counters +(https://github.com/grpc/grpc/issues/34886). + +Adding an explicit metric that records the number of open connections avoids +this. + +### Use “connection timed out” as disconnect error for gRPC Keepalive timeouts + +We expect most implementations of [gRPC keepalives](A8-client-side-keepalive.md) +to also set the POSIX socket option `TCP_USER_TIMEOUT` as stated in +[A18: TCP User Timeout](A18-tcp-user-timeout.md). As such, in cases where the +connection is broken, sockets that would otherwise be closed due to gRPC +keepalive timing out, would instead be closed due to `TCP_USER_TIMEOUT`. + +## Implementation + +TBD From 3cca5d97fa8436eafd12ca4aabb5acecfbc88235 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 19 Mar 2025 02:19:18 +0000 Subject: [PATCH 02/17] Add discusion thread --- A94-subchannel-otel-metrics.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index 875ffa95d..c42c6ad41 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -2,10 +2,10 @@ * Author(s): Yash Tibrewal (yashkt@) * Approver: Mark Roth (@markdroth) -* Status: Draft +* Status: In Review * Implemented in: -* Last updated: Mar 14, 2025 -* Discussion at: (filled after thread exists) +* Last updated: Mar 18, 2025 +* Discussion at: https://groups.google.com/g/grpc-io/c/iMdK7r4E5tU ## Abstract From d27143c7db752c5fc353c6cf198ae81e449a3a8f Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 2 Apr 2025 21:39:18 +0000 Subject: [PATCH 03/17] Reviewer comments --- A94-subchannel-otel-metrics.md | 87 ++++++++++++++++++++-------------- 1 file changed, 52 insertions(+), 35 deletions(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index c42c6ad41..056faf845 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -1,10 +1,10 @@ ## A94: OTel metrics for Subchannels -* Author(s): Yash Tibrewal (yashkt@) +* Author(s): Yash Tibrewal (yashykt@) * Approver: Mark Roth (@markdroth) * Status: In Review * Implemented in: -* Last updated: Mar 18, 2025 +* Last updated: 2025-04-02 * Discussion at: https://groups.google.com/g/grpc-io/c/iMdK7r4E5tU ## Abstract @@ -20,21 +20,26 @@ disconnections for subchannels and connection attempts made for those subchannels. These metrics do not currently contain information on the reason for disconnection, the xds locality or the cluster information. -[A89: Backend Service Metric Label](https://github.com/grpc/proposal/pull/471) -is a proposal to introduce a new optional label `grpc.lb.backend_service` to -client-side per-attempt metrics. This label has xds cluster information. +[A89] is a proposal to introduce a new optional label `grpc.lb.backend_service` +to client-side per-attempt metrics. This label has xds cluster information. ### Related Proposals: -* [A66: OpenTelemetry Metrics] -* [A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient] -* [A79: Non-per-call Metrics Architecture] -* [A89: Backend Service Metric Label] - -[A66: OpenTelemetry Metrics]: A66-otel-stats.md -[A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient]: A78-grpc-metrics-wrr-pf-xds.md -[A79: Non-per-call Metrics Architecture]: A79-non-per-call-metrics-architecture.md -[A89: Backend Service Metric Label]: https://github.com/grpc/proposal/pull/471 +* [A8]: Client-side Keepalive +* [A18]: TCP User Timeout +* [A61]: IPv4 and IPv6 Dualstack Backend Support +* [A66]: OpenTelemetry Metrics +* [A78]: gRPC OTel Metrics for WRR, Pick First, and XdsClient +* [A79]: Non-per-call Metrics Architecture +* [A89]: Backend Service Metric Label + +[A8]: A8-client-side-keepalive.md +[A18]: A18-tcp-user-timeout.md +[A61]: A61-IPv4-IPv6-dualstack-backends.md +[A66]: A66-otel-stats.md +[A78]: A78-grpc-metrics-wrr-pf-xds.md +[A79]: A79-non-per-call-metrics-architecture.md +[A89]: A89-backend-service-metric-label.md ## Proposal @@ -50,31 +55,43 @@ grpc.subchannel.connection_attempts_failed (Old - grpc.lb.pick_first.connection_ grpc.subchannel.open_connections | UpDown Counter | {connection} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of open connections. If we end up discarding connection attempts as we do with the “happy eyeballs” -algorithm (as per -[A61: IPv4 and IPv6 Dualstack Backend Support](A61-IPv4-IPv6-dualstack-backends.md)), -we should not record the connection attempt or the disconnection. +algorithm (as per [A61]), we should not record the connection attempt or the +disconnection. Implementations that have already implemented the pick-first metrics should give enough time for users to transition to the new metrics. For example, implementations should report both the old pick-first metrics and the new subchannel metrics for 2 releases, and then remove the old pick-first metrics. -Label Name | Disposition | Description ------------------------ | ----------- | ----------- -grpc.target | Required | Indicates the target of the gRPC channel (defined in [A66](A66-otel-stats.md).) -grpc.lb.backend_service | Optional | The backend service to which the RPC was routed (defined in [A89](https://github.com/grpc/proposal/pull/471)) -grpc.lb.locality | Optional | The locality to which the traffic is being sent. This will be set to the resolver attribute passed down from the weighted_target policy, or the empty string if the resolver attribute is unset. (defined in [A78](A78-grpc-metrics-wrr-pf-xds.md)) -grpc.disconnect_error | Optional | Reason for disconnection +| Label Name | Disposition | Description | +| ----------------------- | ----------- | -------------------------- | +| grpc.target | Required | Indicates the target of | +: : : the gRPC channel (defined : +: : : in : +: : : [A66](A66-otel-stats.md).) : +| grpc.lb.backend_service | Optional | The backend service to | +: : : which the RPC was routed : +: : : (defined in [A89]) : +| grpc.lb.locality | Optional | The locality to which the | +: : : traffic is being sent. : +: : : This will be set to the : +: : : resolver attribute passed : +: : : down from the : +: : : weighted_target policy, or : +: : : the empty string if the : +: : : resolver attribute is : +: : : unset. (defined in [A78]) : +| grpc.disconnect_error | Optional | Reason for disconnection | List of allowed values for `grpc.disconnect_error` - Error string | Description -------------------- | ----------- GOAWAY | HTTP2 GOAWAY frame with error code for example (“GOAWAY NO_ERROR”, “GOAWAY PROTOCOL_ERROR”, “GOAWAY ENHANCE_YOUR_CALM”). The list of error codes is available in [RFC 9113](https://www.rfc-editor.org/rfc/rfc9113.html#name-error-codes). -subchannel shutdown | The subchannel was shutdown. This can happen due to reasons such as the parent channel shutting down, channel becoming idle, the load balancing policy changing due to a resolver update or a change in list of endpoint addresses. +subchannel shutdown | The subchannel was shutdown. This can happen due to reasons such as the parent channel shutting down, channel becoming idle, the load balancing policy changing due to a resolver update, or a change in list of endpoint addresses. connection reset | Connection was reset (eg. ECONNRESET, WSAECONNERESET) -connection timed out | Connection timed out, also includes connections closed due to gRPC keepalives. -connection aborted | Connection was aborted +connection timed out | Connection timed out (eg. ETIMEDOUT, WSAETIMEDOUT), also includes connections closed due to [A8]: gRPC keepalives. +connection aborted | Connection was aborted (eg. ECONNABORTED) socket error | Any socket error not covered by “connection reset”, “connection timed out” and “connection aborted”. Implementations that are not able to differentiate between the different socket error codes should also use this. unknown | Catch-all for all other reasons. @@ -91,9 +108,8 @@ We might add more error cases to this in the future. The existing pick-first metrics provides stats on subchannel disconnections and connection attempts as viewed from the perspective of the pick-first lb policy. -[A61: IPv4 and IPv6 Dualstack Backend Support](A61-IPv4-IPv6-dualstack-backends.md) -made pick-first lb policy the universal leaf policy. For users unfamiliar with -this, it will come as a surprise when metrics for pick-first lb policy are +[A61] made pick-first lb policy the universal leaf policy. For users unfamiliar +with this, it will come as a surprise when metrics for pick-first lb policy are populated when round_robin lb policy is configured (for example). Additionally, the pick-first metrics are defined from the perspective of the channel. This means that if subchannels are shared between multiple channels (as is the case @@ -115,13 +131,14 @@ simple subtraction of the two counters Adding an explicit metric that records the number of open connections avoids this. -### Use “connection timed out” as disconnect error for gRPC Keepalive timeouts +### Combining connection timeouts and keepalives into a single disconnection error -We expect most implementations of [gRPC keepalives](A8-client-side-keepalive.md) -to also set the POSIX socket option `TCP_USER_TIMEOUT` as stated in -[A18: TCP User Timeout](A18-tcp-user-timeout.md). As such, in cases where the -connection is broken, sockets that would otherwise be closed due to gRPC -keepalive timing out, would instead be closed due to `TCP_USER_TIMEOUT`. +We expect most implementations of [A8] to also set the POSIX socket option +`TCP_USER_TIMEOUT` with the same timeout value as stated in [A18]. As such, in +cases where the connection is broken, the keepalive timeout will race with +sockets being closed due to `TCP_USER_TIMEOUT`. Since the motive of the two +timers is essentially the same, we choose to combine them into a single error, +instead of trying to differentiate between them. ## Implementation From 9cbb5ff15c0857e90927901c52c83f225ec6bd6c Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 2 Apr 2025 21:58:21 +0000 Subject: [PATCH 04/17] Add updated by tag to A78 --- A78-grpc-metrics-wrr-pf-xds.md | 1 + 1 file changed, 1 insertion(+) diff --git a/A78-grpc-metrics-wrr-pf-xds.md b/A78-grpc-metrics-wrr-pf-xds.md index e05047543..2397f2e60 100644 --- a/A78-grpc-metrics-wrr-pf-xds.md +++ b/A78-grpc-metrics-wrr-pf-xds.md @@ -6,6 +6,7 @@ A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient * Implemented in: * Last updated: 2024-09-24 * Discussion at: https://groups.google.com/g/grpc-io/c/A2Mqz8OMDys +* Updated by: [A94: OTel metrics for Subchannels](A94-subchannel-otel-metrics.md) ## Abstract From f561b9572fba7c6e5ce181b63c61ab0a97ed6368 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 3 Apr 2025 01:11:05 +0000 Subject: [PATCH 05/17] Add note on stability --- A94-subchannel-otel-metrics.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index 056faf845..2110ef6f3 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -102,6 +102,13 @@ first seen reason should be chosen, `GOAWAY ENHANCE_YOUR_CALM` in this case. We might add more error cases to this in the future. +### Stability + +As recommended by [A79], these metrics will start off as experimental, and hence +off-by-default. The decision on whether these metrics will be on-by-default or +off-by-default on de-experimentalization will be made at the same time as the +de-experimentalization. + ## Rationale ### Renaming pick-first metrics From 0897db6bd3e88f0c5b9e25ec427693dffba469b8 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 3 Apr 2025 01:17:18 +0000 Subject: [PATCH 06/17] Formatting --- A94-subchannel-otel-metrics.md | 37 +++++++++++++++------------------- 1 file changed, 16 insertions(+), 21 deletions(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index 2110ef6f3..19dec8aa3 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -9,8 +9,8 @@ ## Abstract -Introduce metrics for subchannels. These metrics will replace the existing -pick-first metrics. +Introduce OpenTelemetry metrics for subchannels. These metrics will replace the +existing pick-first metrics. ## Background @@ -63,25 +63,20 @@ enough time for users to transition to the new metrics. For example, implementations should report both the old pick-first metrics and the new subchannel metrics for 2 releases, and then remove the old pick-first metrics. -| Label Name | Disposition | Description | -| ----------------------- | ----------- | -------------------------- | -| grpc.target | Required | Indicates the target of | -: : : the gRPC channel (defined : -: : : in : -: : : [A66](A66-otel-stats.md).) : -| grpc.lb.backend_service | Optional | The backend service to | -: : : which the RPC was routed : -: : : (defined in [A89]) : -| grpc.lb.locality | Optional | The locality to which the | -: : : traffic is being sent. : -: : : This will be set to the : -: : : resolver attribute passed : -: : : down from the : -: : : weighted_target policy, or : -: : : the empty string if the : -: : : resolver attribute is : -: : : unset. (defined in [A78]) : -| grpc.disconnect_error | Optional | Reason for disconnection | +| Label Name | Disposition | Description | +| ----------------------- | ----------- | ------------------------------------ | +| grpc.target | Required | Indicates the target of the gRPC | +: : : channel (defined in [A66].) : +| grpc.lb.backend_service | Optional | The backend service to which the RPC | +: : : was routed (defined in [A89].) : +| grpc.lb.locality | Optional | The locality to which the traffic is | +: : : being sent. This will be set to the : +: : : resolver attribute passed down from : +: : : the weighted_target policy, or the : +: : : empty string if the resolver : +: : : attribute is unset (defined in : +: : : [A78].) : +| grpc.disconnect_error | Optional | Reason for disconnection. | List of allowed values for `grpc.disconnect_error` - From bee13c6d4ec75a2a6c3fd51403ebc08f88e1a1d5 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 3 Apr 2025 01:19:04 +0000 Subject: [PATCH 07/17] Add windows error code for connection aborted --- A94-subchannel-otel-metrics.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index 19dec8aa3..aee5c81da 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -84,9 +84,9 @@ Error string | Description -------------------- | ----------- GOAWAY | HTTP2 GOAWAY frame with error code for example (“GOAWAY NO_ERROR”, “GOAWAY PROTOCOL_ERROR”, “GOAWAY ENHANCE_YOUR_CALM”). The list of error codes is available in [RFC 9113](https://www.rfc-editor.org/rfc/rfc9113.html#name-error-codes). subchannel shutdown | The subchannel was shutdown. This can happen due to reasons such as the parent channel shutting down, channel becoming idle, the load balancing policy changing due to a resolver update, or a change in list of endpoint addresses. -connection reset | Connection was reset (eg. ECONNRESET, WSAECONNERESET) +connection reset | Connection was reset (eg. ECONNRESET, WSAECONNERESET.) connection timed out | Connection timed out (eg. ETIMEDOUT, WSAETIMEDOUT), also includes connections closed due to [A8]: gRPC keepalives. -connection aborted | Connection was aborted (eg. ECONNABORTED) +connection aborted | Connection was aborted (eg. ECONNABORTED, WSAECONNABORTED.) socket error | Any socket error not covered by “connection reset”, “connection timed out” and “connection aborted”. Implementations that are not able to differentiate between the different socket error codes should also use this. unknown | Catch-all for all other reasons. From 062f6734b1ff91dc110ff9e059e534162cfb8780 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 1 May 2025 20:33:48 +0000 Subject: [PATCH 08/17] Add security level label --- A94-subchannel-otel-metrics.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index 322202448..abe5d1798 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -27,6 +27,7 @@ to client-side per-attempt metrics. This label has xds cluster information. * [A8]: Client-side Keepalive * [A18]: TCP User Timeout * [A61]: IPv4 and IPv6 Dualstack Backend Support +* [A62]: gRPC security level negotiation between call credentials and channels * [A66]: OpenTelemetry Metrics * [A78]: gRPC OTel Metrics for WRR, Pick First, and XdsClient * [A79]: Non-per-call Metrics Architecture @@ -51,7 +52,7 @@ Metric Name grpc.subchannel.disconnections (Old - grpc.lb.pick_first.disconnections) | Counter | {disconnection} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional), grpc.disconnect_error (optional) | Number of times the selected subchannel becomes disconnected. grpc.subchannel.connection_attempts_succeeded (Old - grpc.lb.pick_first.connection_attempts_succeeded) | Counter | {attempt} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of successful connection attempts. grpc.subchannel.connection_attempts_failed (Old - grpc.lb.pick_first.connection_attempts_failed) | Counter | {attempt} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of failed connection attempts. -grpc.subchannel.open_connections | UpDown Counter | {connection} | grpc.target, grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of open connections. +grpc.subchannel.open_connections | UpDown Counter | {connection} | grpc.target, grpc.security_level (optional), grpc.lb.backend_service (optional), grpc.lb.locality (optional) | Number of open connections. If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm (as per [A61]), we should not record the connection attempt or the @@ -68,6 +69,7 @@ grpc.target | Required | Indicates the target of the gRPC channel grpc.lb.backend_service | Optional | The backend service to which the RPC was routed (defined in [A89].) grpc.lb.locality | Optional | The locality to which the traffic is being sent. This will be set to the resolver attribute passed down from the weighted_target policy, or the empty string if the resolver attribute is unset (defined in [A78].) grpc.disconnect_error | Optional | Reason for disconnection. +grpc.security_level | Optional | Denotes the security level of the connection. Allowed values - "none", "integrity only" and "privacy and integrity". List of allowed values for `grpc.disconnect_error` - From ea53aab286e6f0f27fbe0172fa308bd0d630db8c Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Tue, 17 Jun 2025 21:08:41 +0000 Subject: [PATCH 09/17] Reviewer comments --- A94-subchannel-otel-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index abe5d1798..3fc5225c3 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -69,7 +69,7 @@ grpc.target | Required | Indicates the target of the gRPC channel grpc.lb.backend_service | Optional | The backend service to which the RPC was routed (defined in [A89].) grpc.lb.locality | Optional | The locality to which the traffic is being sent. This will be set to the resolver attribute passed down from the weighted_target policy, or the empty string if the resolver attribute is unset (defined in [A78].) grpc.disconnect_error | Optional | Reason for disconnection. -grpc.security_level | Optional | Denotes the security level of the connection. Allowed values - "none", "integrity only" and "privacy and integrity". +grpc.security_level | Optional | Denotes the security level of the connection. Allowed values - "none", "integrity_only" and "privacy_and_integrity". List of allowed values for `grpc.disconnect_error` - From 2f47f5f62698ba7b2ab11f56e415089359e05ed4 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Tue, 17 Jun 2025 21:10:02 +0000 Subject: [PATCH 10/17] Reviewer comments --- A94-subchannel-otel-metrics.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index 3fc5225c3..f705c3ffc 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -27,11 +27,11 @@ to client-side per-attempt metrics. This label has xds cluster information. * [A8]: Client-side Keepalive * [A18]: TCP User Timeout * [A61]: IPv4 and IPv6 Dualstack Backend Support -* [A62]: gRPC security level negotiation between call credentials and channels * [A66]: OpenTelemetry Metrics * [A78]: gRPC OTel Metrics for WRR, Pick First, and XdsClient * [A79]: Non-per-call Metrics Architecture * [A89]: Backend Service Metric Label +* [L62]: gRPC security level negotiation between call credentials and channels [A8]: A8-client-side-keepalive.md [A18]: A18-tcp-user-timeout.md @@ -40,6 +40,7 @@ to client-side per-attempt metrics. This label has xds cluster information. [A78]: A78-grpc-metrics-wrr-pf-xds.md [A79]: A79-non-per-call-metrics-architecture.md [A89]: A89-backend-service-metric-label.md +[L62]: L62-grpc-security-level-negotiation.md ## Proposal From f0572a0fc527c15b9d56f944b70e0b41934ffdeb Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 23 Jun 2025 21:15:01 +0000 Subject: [PATCH 11/17] Reviewer comment --- A94-subchannel-otel-metrics.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index f705c3ffc..f0ca9825b 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -72,6 +72,11 @@ grpc.lb.locality | Optional | The locality to which the traffic is bei grpc.disconnect_error | Optional | Reason for disconnection. grpc.security_level | Optional | Denotes the security level of the connection. Allowed values - "none", "integrity_only" and "privacy_and_integrity". +The resolver attributes for the `grpc.lb.backend_service` and `grpc.lb.locality` +labels (defined in [A89] and [A78] respectively) will be passed into the +subchannel. This implies that the subchannel will be recreated when these +attributes change. + List of allowed values for `grpc.disconnect_error` - Error string | Description From 6c8de90c86933b70a2774045118833c8ec833afc Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 25 Jun 2025 17:25:24 -0700 Subject: [PATCH 12/17] Fix github id --- A94-subchannel-otel-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index f0ca9825b..ea8216b76 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -1,6 +1,6 @@ ## A94: OTel metrics for Subchannels -* Author(s): Yash Tibrewal (yashykt@) +* Author(s): Yash Tibrewal (@yashykt) * Approver: Mark Roth (@markdroth) * Status: In Review * Implemented in: From 7c664c2290ed6532b66ef4ed3b907f0866a63ffe Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 27 Jun 2025 16:55:24 -0700 Subject: [PATCH 13/17] Fix link --- A94-subchannel-otel-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index ea8216b76..bb5f1ec1e 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -40,7 +40,7 @@ to client-side per-attempt metrics. This label has xds cluster information. [A78]: A78-grpc-metrics-wrr-pf-xds.md [A79]: A79-non-per-call-metrics-architecture.md [A89]: A89-backend-service-metric-label.md -[L62]: L62-grpc-security-level-negotiation.md +[L62]: L62-core-call-credential-security-level.md ## Proposal From d846946211a88321efada76b8e87d091d35821fc Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Tue, 1 Jul 2025 13:12:18 -0700 Subject: [PATCH 14/17] Reviewer comment --- A78-grpc-metrics-wrr-pf-xds.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A78-grpc-metrics-wrr-pf-xds.md b/A78-grpc-metrics-wrr-pf-xds.md index fe5ec1331..7fb5777a5 100644 --- a/A78-grpc-metrics-wrr-pf-xds.md +++ b/A78-grpc-metrics-wrr-pf-xds.md @@ -4,7 +4,7 @@ A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-09-24 +* Last updated: 2025-07-01 * Discussion at: https://groups.google.com/g/grpc-io/c/A2Mqz8OMDys * Updated by: [A88: xDS Data Error Handling](A88-xds-data-error-handling.md), [A94: OTel metrics for Subchannels](A94-subchannel-otel-metrics.md) @@ -103,7 +103,7 @@ The following metrics will be exported: | grpc.lb.wrr.endpoint_weight_stale | Counter | {endpoint} | grpc.target, grpc.lb.locality | Number of endpoints from each scheduler update whose latest weight is older than the expiration period. | | grpc.lb.wrr.endpoint_weights | Histogram | {weight} | grpc.target, grpc.lb.locality | Weight of each endpoint, recorded on every scheduler update. Endpoints without usable weights will be recorded as weight 0. | -### Pick First LB Policy +### [Outdated] Pick First LB Policy (Updated by [A94](A94-subchannel-otel-metrics.md)) The Pick First LB policy predates the gRFC process but was updated in [A62]. We propose to add the following metrics to it. From 61e4808338ede12530bd70d09de58763faf2128f Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 10 Jul 2025 15:07:11 -0700 Subject: [PATCH 15/17] Reviewer comments --- A94-subchannel-otel-metrics.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index bb5f1ec1e..4a2d3cb5c 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -28,6 +28,7 @@ to client-side per-attempt metrics. This label has xds cluster information. * [A18]: TCP User Timeout * [A61]: IPv4 and IPv6 Dualstack Backend Support * [A66]: OpenTelemetry Metrics +* [A74]: xDS Config Tears * [A78]: gRPC OTel Metrics for WRR, Pick First, and XdsClient * [A79]: Non-per-call Metrics Architecture * [A89]: Backend Service Metric Label @@ -37,6 +38,7 @@ to client-side per-attempt metrics. This label has xds cluster information. [A18]: A18-tcp-user-timeout.md [A61]: A61-IPv4-IPv6-dualstack-backends.md [A66]: A66-otel-stats.md +[A74]: A74-xds-config-tears.md [A78]: A78-grpc-metrics-wrr-pf-xds.md [A79]: A79-non-per-call-metrics-architecture.md [A89]: A89-backend-service-metric-label.md @@ -72,10 +74,12 @@ grpc.lb.locality | Optional | The locality to which the traffic is bei grpc.disconnect_error | Optional | Reason for disconnection. grpc.security_level | Optional | Denotes the security level of the connection. Allowed values - "none", "integrity_only" and "privacy_and_integrity". -The resolver attributes for the `grpc.lb.backend_service` and `grpc.lb.locality` -labels (defined in [A89] and [A78] respectively) will be passed into the -subchannel. This implies that the subchannel will be recreated when these -attributes change. +The subchannel needs to be passed attributes with the values for the +`grpc.lb.backend_service` and `grpc.lb.locality` labels (defined in [A89] and +[A78] respectively). This implies that the subchannel will be recreated when +these attributes change. Since currently, only xDS is using these labels, the +attributes will be set for each endpoint by cds (post-[A74]) or +xds_cluster_resolver (pre-[A74]) LB policies. List of allowed values for `grpc.disconnect_error` - From 9bbea0e40bc7277ab1de819083b77fcde2e16b7a Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 11 Jul 2025 09:08:32 -0700 Subject: [PATCH 16/17] Reviewer comment --- A94-subchannel-otel-metrics.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index 4a2d3cb5c..ba4e81838 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -4,7 +4,7 @@ * Approver: Mark Roth (@markdroth) * Status: In Review * Implemented in: -* Last updated: 2025-04-02 +* Last updated: 2025-07-11 * Discussion at: https://groups.google.com/g/grpc-io/c/iMdK7r4E5tU ## Abstract @@ -78,7 +78,7 @@ The subchannel needs to be passed attributes with the values for the `grpc.lb.backend_service` and `grpc.lb.locality` labels (defined in [A89] and [A78] respectively). This implies that the subchannel will be recreated when these attributes change. Since currently, only xDS is using these labels, the -attributes will be set for each endpoint by cds (post-[A74]) or +attributes will be set for each endpoint or address by cds (post-[A74]) or xds_cluster_resolver (pre-[A74]) LB policies. List of allowed values for `grpc.disconnect_error` - From 35dd385ca868372a45f232ebd9004992fe5e934b Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Tue, 12 Aug 2025 16:00:00 -0700 Subject: [PATCH 17/17] Move status to ready for implementation --- A94-subchannel-otel-metrics.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/A94-subchannel-otel-metrics.md b/A94-subchannel-otel-metrics.md index ba4e81838..f41d23b6d 100644 --- a/A94-subchannel-otel-metrics.md +++ b/A94-subchannel-otel-metrics.md @@ -1,10 +1,11 @@ ## A94: OTel metrics for Subchannels * Author(s): Yash Tibrewal (@yashykt) -* Approver: Mark Roth (@markdroth) -* Status: In Review +* Approver: Mark Roth (@markdroth), Eric Anderson (@ejona86), Doug Fawley + (@dfawley) +* Status: Ready for Implementation * Implemented in: -* Last updated: 2025-07-11 +* Last updated: 2025-08-12 * Discussion at: https://groups.google.com/g/grpc-io/c/iMdK7r4E5tU ## Abstract