From 2241fe5edb4f35b8895cea4fc045f24095590a80 Mon Sep 17 00:00:00 2001 From: ericdbishop Date: Wed, 4 Dec 2024 14:41:49 -0500 Subject: [PATCH 01/13] gep: add GEP-3388 HTTPRoute Retry Budget --- geps/gep-3388/index.md | 134 ++++++++++++++++++++++++++++++++++++ geps/gep-3388/metadata.yaml | 45 ++++++++++++ 2 files changed, 179 insertions(+) create mode 100644 geps/gep-3388/index.md create mode 100644 geps/gep-3388/metadata.yaml diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md new file mode 100644 index 0000000000..a5186091b4 --- /dev/null +++ b/geps/gep-3388/index.md @@ -0,0 +1,134 @@ +# GEP-3388: HTTPRoute Retry Budget + +* Issue: [#3388](https://github.com/kubernetes-sigs/gateway-api/issues/3388) +* Status: Provisional + +(See status definitions [here](/geps/overview/#gep-states).) + +## TLDR + +To allow budgeted retry configuration of a Gateway, in order to retry unsuccessful requests based on a percentage of it's +active request load, as opposed to a static max retry value. + +## Goals + +* To allow specification of a retry + ["budget"](https://finagle.github.io/blog/2016/02/08/retry-budgets/) to + determine whether a request should be retried, and any shared configuration + or interaction with max count retry configuration. +* To allow specification of the percentage of active requests that should be able to be retried at the same time. +* To allow specification of the minimum number of retries that should be + allowed per second or concurrently, such that the budget for retries never + goes below this minimum value. +* To define a standard for retry budgets that reconciles the known + differences in retry budget functionality between Gateway API implementations. + +## Future Goals + +## Non-Goals + +## Introduction + +Multiple data plane proxies offer optional configuration for budgeted retries, +either as a circuit breaker threshold for concurrent retries or as an +alternative for configuring a +static retry limit for client retries. In the case of Linkerd, retry budgets +are the default retry policy configuration for HTTP retries, with static max +retries being a fairly recent addition. + +Configuring a limit for client retries is an important factor in building a +resilient system in order to +allow for requests to be successfully retried during periods of intermittent +failure. But too many client-side retries can also exacerbate consistent +failures and slow down recovery, quickly overwhelming a failing +system and leading to retry +storms. Configuring a sane +limit for max client-side retries is often challenging in complex +systems. Allowing an application developer (Ana) to instead configure a dynamic +"retry budget" prevents them from needing to decide on a static max retry value +that will perform as expected in both times of high & low request load, as well +as periods of intermittent or consistent failures. + +While HTTPRoute retry budget configuration has been a frequently discussed +feature within the community, differences in semantics between different data +plane proxies +creates a challenge for a consensus on the correct location for the +configuration. + +Envoy, for example, offers retry budgets as a configurable circuit breaker threshold +for concurrent retries to an upstream cluster. In Istio, Envoy circuit breaker +thresholds are typically configured [within the DestinationRule +CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), +which +applies rules to clients of a service after routing has already occurred. +The linkerd implementation of +retry budgets is configured on specific routes, and instead limits the number +of total retry attempts as a percentage of original requests. This creates a +challenge for +defining where retry budget's should be configured within the Gateway API, +and how data plane proxies may need to be altered to accommodate the correct +path forward. If Istio were to implement Envoy's retry budget threshold also +at the per-route level in their API, retry budget +configuration would need to be introduced within [the VirtualService +CRD](https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry). +Envoy's retry budget threshold does not address overall retry attempts on the +client-side, though. A potential solution would be for Envoy to additionally +allow a budget for retry *attempts* as well as a concurrent retry threshold. + +When configuring a retry budget on the route, you +subsequently need to define this value for each one. Defining a single +retry budget threshold for a destination is a simpler approach. + +### Background on implementations + + +#### Envoy + +Supports configuring a [RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) with a following parameters in cluster CircuitBreaker thresholds. + +* `budget_percent` Specifies the limit on concurrent retries as a percentage of the sum of active requests and active pending requests. For example, if there are 100 active requests and the budget_percent is set to 25, there may be 25 active retries. This parameter is optional. Defaults to 20%. + +* `min_retry_concurrency` Specifies the minimum retry concurrency allowed for the retry budget. The limit on the number of active retries may never go below this number. This parameter is optional. Defaults to 3. + +#### NGINX + + +#### HAProxy + + +#### Traefik + +Supports configuration of a [Circuit Breaker](https://doc.traefik.io/traefik/middlewares/http/circuitbreaker/) which could possibly be used to implement budgeted retries. Each router gets its own instance of a given circuit breaker. One circuit breaker instance can be open while the other remains closed: their state is not shared. This is the expected behavior, we want you to be able to define what makes a service healthy without having to declare a circuit breaker for each route. + +#### linkerd2-proxy + +Linkerd supports [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/) and - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. + +Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. + +## API + +### Go + + +### YAML + +## Conformance Details + + +## Alternatives + +### Policy Attachment + +## Other considerations + +### What accommodations are needed for retry budget support? + +Changing the retry stanza to a Kubernetes "tagged union" pattern with something like `mode: "budget"` to support mutually-exclusive distinct sibling fields is possible as a non-breaking change if omitting the `mode` field defaults to the currently proposed behavior (which could retroactively become something like `mode: count`). + +## References + +* +* +* +* diff --git a/geps/gep-3388/metadata.yaml b/geps/gep-3388/metadata.yaml new file mode 100644 index 0000000000..beb073820f --- /dev/null +++ b/geps/gep-3388/metadata.yaml @@ -0,0 +1,45 @@ +apiVersion: internal.gateway.networking.k8s.io/v1alpha1 +kind: GEPDetails +number: 1731 +name: HTTPRoute Retries +status: Experimental +# Any authors who contribute to the GEP in any way should be listed here using +# their Github handle. +authors: + - mikemorris +relationships: + # obsoletes indicates that a GEP makes the linked GEP obsolete, and completely + # replaces that GEP. The obsoleted GEP MUST have its obsoletedBy field + # set back to this GEP, and MUST be moved to Declined. + obsoletes: {} + obsoletedBy: {} + # extends indicates that a GEP extends the linkned GEP, adding more detail + # or additional implementation. The extended GEP MUST have its extendedBy + # field set back to this GEP. + extends: {} + extendedBy: {} + # seeAlso indicates other GEPs that are relevant in some way without being + # covered by an existing relationship. + seeAlso: + - number: 2257 + name: Gateway API Duration Format + description: Uses duration format introduced in this GEP. + - number: 1742 + name: HTTPRoute Timeouts + description: Covers some overlapping considerations around when requests should be retried. +# references is a list of hyperlinks to relevant external references. +# It's intended to be used for storing Github discussions, Google docs, etc. +references: + - https://www.rfc-editor.org/rfc/rfc9110 + - https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml +# featureNames is a list of the feature names introduced by the GEP, if there +# are any. This will allow us to track which feature was introduced by which GEP. +featureNames: + - SupportHTTPRouteRetry + - SupportHTTPRouteRetryBackendTimeout + - SupportHTTPRouteRetryBackoff + - SupportHTTPRouteRetryCodes + - SupportHTTPRouteRetryConnectionError +# changelog is a list of hyperlinks to PRs that make changes to the GEP, in +# ascending date order. +changelog: {} From e27468c8e4c02d9356fa64ad7ee6afd42d496184 Mon Sep 17 00:00:00 2001 From: ericdbishop Date: Wed, 4 Dec 2024 15:05:47 -0500 Subject: [PATCH 02/13] Update metadata for gep-1731 & gep-3388 --- geps/gep-1731/metadata.yaml | 4 +++- geps/gep-3388/metadata.yaml | 30 ++++++++++-------------------- 2 files changed, 13 insertions(+), 21 deletions(-) diff --git a/geps/gep-1731/metadata.yaml b/geps/gep-1731/metadata.yaml index beb073820f..e8a7f661e2 100644 --- a/geps/gep-1731/metadata.yaml +++ b/geps/gep-1731/metadata.yaml @@ -17,7 +17,9 @@ relationships: # or additional implementation. The extended GEP MUST have its extendedBy # field set back to this GEP. extends: {} - extendedBy: {} + extendedBy: + - number: 3388 + name: HTTPRoute Retry Budget # seeAlso indicates other GEPs that are relevant in some way without being # covered by an existing relationship. seeAlso: diff --git a/geps/gep-3388/metadata.yaml b/geps/gep-3388/metadata.yaml index beb073820f..c044495512 100644 --- a/geps/gep-3388/metadata.yaml +++ b/geps/gep-3388/metadata.yaml @@ -1,11 +1,12 @@ apiVersion: internal.gateway.networking.k8s.io/v1alpha1 kind: GEPDetails -number: 1731 -name: HTTPRoute Retries -status: Experimental +number: 3388 +name: HTTPRoute Retry Budget +status: Provisional # Any authors who contribute to the GEP in any way should be listed here using # their Github handle. authors: + - ericdbishop - mikemorris relationships: # obsoletes indicates that a GEP makes the linked GEP obsolete, and completely @@ -16,30 +17,19 @@ relationships: # extends indicates that a GEP extends the linkned GEP, adding more detail # or additional implementation. The extended GEP MUST have its extendedBy # field set back to this GEP. - extends: {} + extends: + - number: 1731 + name: HTTPRoute Retries extendedBy: {} # seeAlso indicates other GEPs that are relevant in some way without being # covered by an existing relationship. - seeAlso: - - number: 2257 - name: Gateway API Duration Format - description: Uses duration format introduced in this GEP. - - number: 1742 - name: HTTPRoute Timeouts - description: Covers some overlapping considerations around when requests should be retried. + seeAlso: {} # references is a list of hyperlinks to relevant external references. # It's intended to be used for storing Github discussions, Google docs, etc. -references: - - https://www.rfc-editor.org/rfc/rfc9110 - - https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml +references: {} # featureNames is a list of the feature names introduced by the GEP, if there # are any. This will allow us to track which feature was introduced by which GEP. -featureNames: - - SupportHTTPRouteRetry - - SupportHTTPRouteRetryBackendTimeout - - SupportHTTPRouteRetryBackoff - - SupportHTTPRouteRetryCodes - - SupportHTTPRouteRetryConnectionError +featureNames: {} # changelog is a list of hyperlinks to PRs that make changes to the GEP, in # ascending date order. changelog: {} From b9df3f41b25e2224861739db7d07d78855ef8938 Mon Sep 17 00:00:00 2001 From: ericdbishop Date: Wed, 4 Dec 2024 15:36:39 -0500 Subject: [PATCH 03/13] Correcting information and readability --- geps/gep-3388/index.md | 48 +++++++++++++++++++----------------------- 1 file changed, 22 insertions(+), 26 deletions(-) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index a5186091b4..4bff3fd40d 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -7,8 +7,9 @@ ## TLDR -To allow budgeted retry configuration of a Gateway, in order to retry unsuccessful requests based on a percentage of it's -active request load, as opposed to a static max retry value. +To allow configuration of a "retry budget" in HTTPRoute, to make total +client-side retries a percentage of a destination service's active request +load, in place of configuring a static max count retry value. ## Goals @@ -16,12 +17,14 @@ active request load, as opposed to a static max retry value. ["budget"](https://finagle.github.io/blog/2016/02/08/retry-budgets/) to determine whether a request should be retried, and any shared configuration or interaction with max count retry configuration. -* To allow specification of the percentage of active requests that should be able to be retried at the same time. -* To allow specification of the minimum number of retries that should be +* To allow specification of a percentage of active requests that should be able + to be retried concurrently. +* To allow specification of a *minimum* number of retries that should be allowed per second or concurrently, such that the budget for retries never goes below this minimum value. * To define a standard for retry budgets that reconciles the known - differences in retry budget functionality between Gateway API implementations. + differences in retry budget functionality between Gateway API data plane + implementations. ## Future Goals @@ -30,24 +33,23 @@ active request load, as opposed to a static max retry value. ## Introduction Multiple data plane proxies offer optional configuration for budgeted retries, -either as a circuit breaker threshold for concurrent retries or as an +either as a circuit breaker threshold for concurrent retries, or as an alternative for configuring a static retry limit for client retries. In the case of Linkerd, retry budgets are the default retry policy configuration for HTTP retries, with static max -retries being a fairly recent addition. +retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/). Configuring a limit for client retries is an important factor in building a -resilient system in order to -allow for requests to be successfully retried during periods of intermittent -failure. But too many client-side retries can also exacerbate consistent +resilient system, allowing requests to be successfully retried during periods +of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to instead configure a dynamic -"retry budget" prevents them from needing to decide on a static max retry value +"retry budget", prevents them from needing to decide on a static max retry value that will perform as expected in both times of high & low request load, as well -as periods of intermittent or consistent failures. +as both during periods of intermittent & consistent failures. While HTTPRoute retry budget configuration has been a frequently discussed feature within the community, differences in semantics between different data @@ -62,22 +64,16 @@ CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/# which applies rules to clients of a service after routing has already occurred. The linkerd implementation of -retry budgets is configured on specific routes, and instead limits the number -of total retry attempts as a percentage of original requests. This creates a -challenge for -defining where retry budget's should be configured within the Gateway API, -and how data plane proxies may need to be altered to accommodate the correct -path forward. If Istio were to implement Envoy's retry budget threshold also -at the per-route level in their API, retry budget +retry budgets is configured within the ServiceProfile CRD, limiting the number +of total retry attempts across routes as a percentage of original requests. +This creates a question of where retry budget's should be defined within the +Gateway API, +and whether data plane proxies may need to be altered to accommodate the correct +path forward. If Istio were to implement Envoy's retry budget +threshold where +routing occurs in their API, retry budget configuration would need to be introduced within [the VirtualService CRD](https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry). -Envoy's retry budget threshold does not address overall retry attempts on the -client-side, though. A potential solution would be for Envoy to additionally -allow a budget for retry *attempts* as well as a concurrent retry threshold. - -When configuring a retry budget on the route, you -subsequently need to define this value for each one. Defining a single -retry budget threshold for a destination is a simpler approach. ### Background on implementations From 538bb6124268fac1c6158ba650f3b2838d83c4b5 Mon Sep 17 00:00:00 2001 From: ericdbishop Date: Sun, 8 Dec 2024 13:51:09 -0500 Subject: [PATCH 04/13] Improve background information and context; remove unused sub-sections --- geps/gep-3388/index.md | 80 ++++++++++++++++++++---------------------- 1 file changed, 39 insertions(+), 41 deletions(-) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index 4bff3fd40d..ed4adca945 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -7,48 +7,53 @@ ## TLDR -To allow configuration of a "retry budget" in HTTPRoute, to make total -client-side retries a percentage of a destination service's active request -load, in place of configuring a static max count retry value. +To allow configuration of a "retry budget" in HTTPRoute, to limit the rate of +client-side retries based on a percentage of the active request load across all +endpoints of a destination service. ## Goals * To allow specification of a retry ["budget"](https://finagle.github.io/blog/2016/02/08/retry-budgets/) to determine whether a request should be retried, and any shared configuration - or interaction with max count retry configuration. -* To allow specification of a percentage of active requests that should be able + or interaction with configuration of a static retry limit within HTTPRoute. +* To allow specification of a percentage of active requests, or recently active + requests, that should be able to be retried concurrently. * To allow specification of a *minimum* number of retries that should be allowed per second or concurrently, such that the budget for retries never goes below this minimum value. * To define a standard for retry budgets that reconciles the known - differences in retry budget functionality between Gateway API data plane - implementations. - -## Future Goals + differences in current retry budget functionality between Gateway API data + plane implementations. ## Non-Goals +* To allow specifying a default retry budget policy across a namespace or attached to a specific gateway. +* To allow configuration of a backoff strategy or timeout window within the retry budget spec. +* To allow specifying inclusion of specific HTTP status codes and responses within the retry budget spec. +* To allow specification of more than one retry budget for a + givne service, for specific subsets of its traffic. + + ## Introduction Multiple data plane proxies offer optional configuration for budgeted retries, -either as a circuit breaker threshold for concurrent retries, or as an -alternative for configuring a -static retry limit for client retries. In the case of Linkerd, retry budgets -are the default retry policy configuration for HTTP retries, with static max +in order to create a dynamic limit on the amount of a service's active request that is being retried across its clients. In the case of Linkerd, retry budgets +are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/). Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing -system and leading to retry +system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex -systems. Allowing an application developer (Ana) to instead configure a dynamic -"retry budget", prevents them from needing to decide on a static max retry value -that will perform as expected in both times of high & low request load, as well +systems. Allowing an application developer (Ana) to configure a dynamic +"retry budget", reducing the risk of a high number of retries across clients, +allows a service to perform as expected in both times of high & low request +load, as well as both during periods of intermittent & consistent failures. While HTTPRoute retry budget configuration has been a frequently discussed @@ -58,49 +63,42 @@ creates a challenge for a consensus on the correct location for the configuration. Envoy, for example, offers retry budgets as a configurable circuit breaker threshold -for concurrent retries to an upstream cluster. In Istio, Envoy circuit breaker +for concurrent retries to an upstream cluster, in favor of configuring a static +active retry threshold. In Istio, Envoy circuit breaker thresholds are typically configured [within the DestinationRule CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), which applies rules to clients of a service after routing has already occurred. The linkerd implementation of -retry budgets is configured within the ServiceProfile CRD, limiting the number -of total retry attempts across routes as a percentage of original requests. -This creates a question of where retry budget's should be defined within the +retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number +of total retries for a service as a percentage of the number of recent requests. +This proposal aims to determine where retry budget's should be defined within the Gateway API, -and whether data plane proxies may need to be altered to accommodate the correct -path forward. If Istio were to implement Envoy's retry budget -threshold where -routing occurs in their API, retry budget -configuration would need to be introduced within [the VirtualService -CRD](https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry). +and whether data plane proxies may need to be altered to accommodate the +specification. ### Background on implementations - #### Envoy -Supports configuring a [RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) with a following parameters in cluster CircuitBreaker thresholds. +Supports configuring a +[RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) +CircuitBreaker threshold across a group of upstream endpoints, with the following parameters. * `budget_percent` Specifies the limit on concurrent retries as a percentage of the sum of active requests and active pending requests. For example, if there are 100 active requests and the budget_percent is set to 25, there may be 25 active retries. This parameter is optional. Defaults to 20%. * `min_retry_concurrency` Specifies the minimum retry concurrency allowed for the retry budget. The limit on the number of active retries may never go below this number. This parameter is optional. Defaults to 3. -#### NGINX - - -#### HAProxy - - -#### Traefik - -Supports configuration of a [Circuit Breaker](https://doc.traefik.io/traefik/middlewares/http/circuitbreaker/) which could possibly be used to implement budgeted retries. Each router gets its own instance of a given circuit breaker. One circuit breaker instance can be open while the other remains closed: their state is not shared. This is the expected behavior, we want you to be able to define what makes a service healthy without having to declare a circuit breaker for each route. - #### linkerd2-proxy -Linkerd supports [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/) and - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. +Linkerd supports [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/), the default way to specify retries to a service, and - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. -Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. +Linkerd's budgeted retries allow retrying an indefinite number of times, as +long as the fraction of retries remains within the budget. Budgeted retries are +supported only using Linkerd's native ServiceProfile CRD, which allows enabling +retries, setting the retry budget (by default, 20% plus 10 "extra" retries per +second), and configuring the window over which the fraction of retries to +non-retries is calculated. ## API From a5eaa93854100f12a94f4629e0b79d7047ee740d Mon Sep 17 00:00:00 2001 From: ericdbishop Date: Sun, 8 Dec 2024 14:03:13 -0500 Subject: [PATCH 05/13] Mark some sections as TODO --- geps/gep-3388/index.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index ed4adca945..553a552f9c 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -104,18 +104,26 @@ non-retries is calculated. ### Go +TODO ### YAML +TODO + ## Conformance Details +TODO ## Alternatives ### Policy Attachment +TODO + ## Other considerations +TODO + ### What accommodations are needed for retry budget support? Changing the retry stanza to a Kubernetes "tagged union" pattern with something like `mode: "budget"` to support mutually-exclusive distinct sibling fields is possible as a non-breaking change if omitting the `mode` field defaults to the currently proposed behavior (which could retroactively become something like `mode: count`). From 979c8c984c56a9e70e48bdcd9f479e9233a9d0ed Mon Sep 17 00:00:00 2001 From: ericdbishop Date: Sun, 8 Dec 2024 14:09:02 -0500 Subject: [PATCH 06/13] add mkdocs --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index b249db6cdf..db9f3c2d3c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -127,6 +127,7 @@ nav: - geps/gep-1867/index.md - geps/gep-2648/index.md - geps/gep-2649/index.md + - geps/gep-3388/index.md - Implementable: - geps/gep-3155/index.md - Experimental: From d7353c8fed9c8c9b895cf8beae6a5b26d8c325f2 Mon Sep 17 00:00:00 2001 From: ericdbishop Date: Sun, 8 Dec 2024 14:24:52 -0500 Subject: [PATCH 07/13] formatting --- geps/gep-3388/index.md | 82 +++++++++--------------------------------- 1 file changed, 16 insertions(+), 66 deletions(-) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index 553a552f9c..308c351fea 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -7,83 +7,38 @@ ## TLDR -To allow configuration of a "retry budget" in HTTPRoute, to limit the rate of -client-side retries based on a percentage of the active request load across all -endpoints of a destination service. +To allow configuration of a "retry budget" in HTTPRoute, to limit the rate of client-side retries based on a percentage of the active request load across all endpoints of a destination service. ## Goals -* To allow specification of a retry - ["budget"](https://finagle.github.io/blog/2016/02/08/retry-budgets/) to - determine whether a request should be retried, and any shared configuration - or interaction with configuration of a static retry limit within HTTPRoute. -* To allow specification of a percentage of active requests, or recently active - requests, that should be able - to be retried concurrently. -* To allow specification of a *minimum* number of retries that should be - allowed per second or concurrently, such that the budget for retries never - goes below this minimum value. -* To define a standard for retry budgets that reconciles the known - differences in current retry budget functionality between Gateway API data - plane implementations. +* To allow specification of a retry ["budget"](https://finagle.github.io/blog/2016/02/08/retry-budgets/) to determine whether a request should be retried, and any shared configuration or interaction with configuration of a static retry limit within HTTPRoute. +* To allow specification of a percentage of active requests, or recently active requests, that should be able to be retried concurrently. +* To allow specification of a *minimum* number of retries that should be allowed per second or concurrently, such that the budget for retries never goes below this minimum value. +* To define a standard for retry budgets that reconciles the known differences in current retry budget functionality between Gateway API data plane implementations. ## Non-Goals * To allow specifying a default retry budget policy across a namespace or attached to a specific gateway. -* To allow configuration of a backoff strategy or timeout window within the retry budget spec. +* To allow configuration of a back-off strategy or timeout window within the retry budget spec. * To allow specifying inclusion of specific HTTP status codes and responses within the retry budget spec. -* To allow specification of more than one retry budget for a - givne service, for specific subsets of its traffic. +* To allow specification of more than one retry budget for a given service, for specific subsets of its traffic. ## Introduction -Multiple data plane proxies offer optional configuration for budgeted retries, -in order to create a dynamic limit on the amount of a service's active request that is being retried across its clients. In the case of Linkerd, retry budgets -are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max -retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/). - -Configuring a limit for client retries is an important factor in building a -resilient system, allowing requests to be successfully retried during periods -of intermittent failure. But too many client-side retries can also exacerbate consistent -failures and slow down recovery, quickly overwhelming a failing -system and leading to cascading failures such as retry -storms. Configuring a sane -limit for max client-side retries is often challenging in complex -systems. Allowing an application developer (Ana) to configure a dynamic -"retry budget", reducing the risk of a high number of retries across clients, -allows a service to perform as expected in both times of high & low request -load, as well -as both during periods of intermittent & consistent failures. - -While HTTPRoute retry budget configuration has been a frequently discussed -feature within the community, differences in semantics between different data -plane proxies -creates a challenge for a consensus on the correct location for the -configuration. - -Envoy, for example, offers retry budgets as a configurable circuit breaker threshold -for concurrent retries to an upstream cluster, in favor of configuring a static -active retry threshold. In Istio, Envoy circuit breaker -thresholds are typically configured [within the DestinationRule -CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), -which -applies rules to clients of a service after routing has already occurred. -The linkerd implementation of -retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number -of total retries for a service as a percentage of the number of recent requests. -This proposal aims to determine where retry budget's should be defined within the -Gateway API, -and whether data plane proxies may need to be altered to accommodate the -specification. +Multiple data plane proxies offer optional configuration for budgeted retries, in order to create a dynamic limit on the amount of a service's active request that is being retried across its clients. In the case of Linkerd, retry budgets are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/). + +Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget", reducing the risk of a high number of retries across clients, allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures. + +While HTTPRoute retry budget configuration has been a frequently discussed feature within the community, differences in semantics between different data plane proxies creates a challenge for a consensus on the correct location for the configuration. + +Envoy, for example, offers retry budgets as a configurable circuit breaker threshold for concurrent retries to an upstream cluster, in favor of configuring a static active retry threshold. In Istio, Envoy circuit breaker thresholds are typically configured [within the DestinationRule CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), which applies rules to clients of a service after routing has already occurred. The linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. ### Background on implementations #### Envoy -Supports configuring a -[RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) -CircuitBreaker threshold across a group of upstream endpoints, with the following parameters. +Supports configuring a [RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) CircuitBreaker threshold across a group of upstream endpoints, with the following parameters. * `budget_percent` Specifies the limit on concurrent retries as a percentage of the sum of active requests and active pending requests. For example, if there are 100 active requests and the budget_percent is set to 25, there may be 25 active retries. This parameter is optional. Defaults to 20%. @@ -93,12 +48,7 @@ CircuitBreaker threshold across a group of upstream endpoints, with the followin Linkerd supports [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/), the default way to specify retries to a service, and - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. -Linkerd's budgeted retries allow retrying an indefinite number of times, as -long as the fraction of retries remains within the budget. Budgeted retries are -supported only using Linkerd's native ServiceProfile CRD, which allows enabling -retries, setting the retry budget (by default, 20% plus 10 "extra" retries per -second), and configuring the window over which the fraction of retries to -non-retries is calculated. +Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. ## API From c18895c4cb65355c32c7e3f83ad11a355da7fefc Mon Sep 17 00:00:00 2001 From: Eric Bishop Date: Tue, 14 Jan 2025 10:07:05 -0500 Subject: [PATCH 08/13] Minor improvements --- geps/gep-3388/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index 308c351fea..c3613c3779 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -7,7 +7,7 @@ ## TLDR -To allow configuration of a "retry budget" in HTTPRoute, to limit the rate of client-side retries based on a percentage of the active request load across all endpoints of a destination service. +To allow configuration of a "retry budget" in HTTPRoute, to determine when to prevent additional client-side retries, by limiting the percentage of the active request load that may consist of retries, across all endpoints of a destination service. ## Goals @@ -32,7 +32,7 @@ Configuring a limit for client retries is an important factor in building a resi While HTTPRoute retry budget configuration has been a frequently discussed feature within the community, differences in semantics between different data plane proxies creates a challenge for a consensus on the correct location for the configuration. -Envoy, for example, offers retry budgets as a configurable circuit breaker threshold for concurrent retries to an upstream cluster, in favor of configuring a static active retry threshold. In Istio, Envoy circuit breaker thresholds are typically configured [within the DestinationRule CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), which applies rules to clients of a service after routing has already occurred. The linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. +Envoy, for example, offers retry budgets as a configurable circuit breaker threshold for concurrent retries to an upstream cluster, in favor of configuring a static active retry threshold. In Istio, Envoy circuit breaker thresholds are typically configured [within the DestinationRule CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), which applies rules to clients of a service after routing has already occurred. The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. ### Background on implementations From 80474d948cdd7f94b2f90ba2fbbb01baa652383f Mon Sep 17 00:00:00 2001 From: Eric Bishop Date: Thu, 16 Jan 2025 10:07:04 -0500 Subject: [PATCH 09/13] Minor improvements --- geps/gep-3388/index.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index c3613c3779..1b1f015223 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -21,14 +21,14 @@ To allow configuration of a "retry budget" in HTTPRoute, to determine when to pr * To allow specifying a default retry budget policy across a namespace or attached to a specific gateway. * To allow configuration of a back-off strategy or timeout window within the retry budget spec. * To allow specifying inclusion of specific HTTP status codes and responses within the retry budget spec. -* To allow specification of more than one retry budget for a given service, for specific subsets of its traffic. +* To allow specification of more than one retry budget for a given service, or for specific subsets of its traffic. ## Introduction Multiple data plane proxies offer optional configuration for budgeted retries, in order to create a dynamic limit on the amount of a service's active request that is being retried across its clients. In the case of Linkerd, retry budgets are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/). -Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget", reducing the risk of a high number of retries across clients, allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures. +Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget" reduces the risk of a high number of retries across clients. It allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures. While HTTPRoute retry budget configuration has been a frequently discussed feature within the community, differences in semantics between different data plane proxies creates a challenge for a consensus on the correct location for the configuration. @@ -38,12 +38,14 @@ Envoy, for example, offers retry budgets as a configurable circuit breaker thres #### Envoy -Supports configuring a [RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) CircuitBreaker threshold across a group of upstream endpoints, with the following parameters. +Supports configuring an optional [RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) CircuitBreaker threshold across a group of upstream endpoints, with the following parameters: * `budget_percent` Specifies the limit on concurrent retries as a percentage of the sum of active requests and active pending requests. For example, if there are 100 active requests and the budget_percent is set to 25, there may be 25 active retries. This parameter is optional. Defaults to 20%. * `min_retry_concurrency` Specifies the minimum retry concurrency allowed for the retry budget. The limit on the number of active retries may never go below this number. This parameter is optional. Defaults to 3. +By default, Envoy uses a static threshold for retries. But when configured, Envoy's retry budget threshold overrides any other retry circuit breaker that has been configured. + #### linkerd2-proxy Linkerd supports [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/), the default way to specify retries to a service, and - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. From 04fefd8b1e763e2f14abe6bf5bf555aaa67b2c02 Mon Sep 17 00:00:00 2001 From: Eric Bishop Date: Fri, 17 Jan 2025 10:27:17 -0500 Subject: [PATCH 10/13] Detail comparison between Policy Attachment & HTTPRoute configuration; minor improvements --- geps/gep-3388/index.md | 40 ++++++++++++++++++++++++++----------- geps/gep-3388/metadata.yaml | 2 +- 2 files changed, 29 insertions(+), 13 deletions(-) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index 1b1f015223..a460cd250b 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -1,4 +1,4 @@ -# GEP-3388: HTTPRoute Retry Budget +# GEP-3388: HTTP Retry Budget * Issue: [#3388](https://github.com/kubernetes-sigs/gateway-api/issues/3388) * Status: Provisional @@ -7,7 +7,7 @@ ## TLDR -To allow configuration of a "retry budget" in HTTPRoute, to determine when to prevent additional client-side retries, by limiting the percentage of the active request load that may consist of retries, across all endpoints of a destination service. +To allow configuration of a "retry budget" across all endpoints of a destination service, preventing additional client-side retries when the percentage of the active request load consisting of retries reaches a certain threshold. ## Goals @@ -30,15 +30,15 @@ Multiple data plane proxies offer optional configuration for budgeted retries, i Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget" reduces the risk of a high number of retries across clients. It allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures. -While HTTPRoute retry budget configuration has been a frequently discussed feature within the community, differences in semantics between different data plane proxies creates a challenge for a consensus on the correct location for the configuration. - -Envoy, for example, offers retry budgets as a configurable circuit breaker threshold for concurrent retries to an upstream cluster, in favor of configuring a static active retry threshold. In Istio, Envoy circuit breaker thresholds are typically configured [within the DestinationRule CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), which applies rules to clients of a service after routing has already occurred. The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. +While retry budget configuration has been a frequently discussed feature within the community, differences in the semantics between data plane implementations creates a challenge for a consensus on the correct location for the configuration. This proposal aims to determine where retry budget's should be defined within the Gateway API, and whether data plane proxies may need to be altered to accommodate the specification. ### Background on implementations #### Envoy -Supports configuring an optional [RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) CircuitBreaker threshold across a group of upstream endpoints, with the following parameters: +Envoy offers retry budgets as a configurable circuit breaker threshold for concurrent retries to an upstream cluster, in favor of configuring a static max retry threshold. In Istio, Envoy circuit breaker thresholds are typically configured [within the DestinationRule CRD](https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings), which applies rules to clients of a service after routing has already occurred. + +The optional [RetryBudget](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-msg-config-cluster-v3-circuitbreakers-thresholds-retrybudget) CircuitBreaker threshold can be configured with the following parameters: * `budget_percent` Specifies the limit on concurrent retries as a percentage of the sum of active requests and active pending requests. For example, if there are 100 active requests and the budget_percent is set to 25, there may be 25 active retries. This parameter is optional. Defaults to 20%. @@ -48,10 +48,28 @@ By default, Envoy uses a static threshold for retries. But when configured, Envo #### linkerd2-proxy +The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. In practice, this functions similarly to Envoy's retry budget implementation, as it is configured in a single location and measures the ratio of retry requests to original requests across all traffic destined for the service. + Linkerd supports [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/), the default way to specify retries to a service, and - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. +### Proposed Design + +#### Retry Budget Policy Attachment + +While current retry behavior is defined at the routing rule level within HTTPRoute, exposing retry budget configuration as a policy attachment offers some advantages: + +* Users could define a single policy, targeting a service, that would dynamically configure a retry threshold based on the percentage of active requests across *all routes* destined for that service's backends. + +* In both Envoy and Linkerd data plane implementations, a retry budget is configured once to match all endpoints of a service, regardless of the routing rule that the request matches on. A policy attachment will allow for a single configuration for a service's retry budget, as opposed to configuring the retry budget across multiple HTTPRoute objects (see [Alternatives](#httproute-retry-budget)). + +* Being able to configure a dynamic threshold of retries at the service level, alongside a static max number of retries on the route level. In practice, application developers would then be allowed more granular control of which requests should be retried. For example, an application developer may not want to perform retries on a specific route where requests are not idempotent, and can disable retries for that route. By having a retry budget policy configured, retries from other routes will still benefit from the budgeted retries. + +Configuring a retry budget through a Policy Attachment may produce some confusion from a UX perspective, as users will be able to configure retries in two different places (HTTPRoute for static retries, versus a policy attachment for a dynamic retry threshold). Though this is likely a fair trade-off. + +Discrepancies in the semantics of retry budget behavior and configuration options between Envoy and Linkerd may require a change in either implementation to accommodate the Gateway API specification. + ## API ### Go @@ -68,18 +86,16 @@ TODO ## Alternatives -### Policy Attachment +### HTTPRoute Retry Budget -TODO +* The desired UX for retry budgets is to apply the policy at the service level, rather than individually across each route targeting the service. Placing the retry budget configuration within HTTPRoute would violate this requirement, as separate HTTPRoute objects could each have routing rules targeting the same destination service, and a single HTTPRoute object can target multiple destinations. To apply a retry budget to all routes targeting a service, a user would need to duplicate the configuration across multiple routing rules. + +* If we wanted retry budgets to be configured on a per-route basis (as opposed to at the service level), it would require a change to be made in Envoy Route. And more than likely, similar changes would need to be made for Linkerd. ## Other considerations TODO -### What accommodations are needed for retry budget support? - -Changing the retry stanza to a Kubernetes "tagged union" pattern with something like `mode: "budget"` to support mutually-exclusive distinct sibling fields is possible as a non-breaking change if omitting the `mode` field defaults to the currently proposed behavior (which could retroactively become something like `mode: count`). - ## References * diff --git a/geps/gep-3388/metadata.yaml b/geps/gep-3388/metadata.yaml index c044495512..4cbc7a6b82 100644 --- a/geps/gep-3388/metadata.yaml +++ b/geps/gep-3388/metadata.yaml @@ -1,7 +1,7 @@ apiVersion: internal.gateway.networking.k8s.io/v1alpha1 kind: GEPDetails number: 3388 -name: HTTPRoute Retry Budget +name: HTTP Retry Budget status: Provisional # Any authors who contribute to the GEP in any way should be listed here using # their Github handle. From 39b4b0f0d123f71219bac1e7456a5651dbb2009c Mon Sep 17 00:00:00 2001 From: Eric Bishop Date: Fri, 17 Jan 2025 10:52:24 -0500 Subject: [PATCH 11/13] Better compare retry budget options between Linkerd and Envoy implementations --- geps/gep-3388/index.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index a460cd250b..4e9ac4f756 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -50,9 +50,15 @@ By default, Envoy uses a static threshold for retries. But when configured, Envo The Linkerd implementation of retry budgets is configured alongside service route configuration, within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), limiting the number of total retries for a service as a percentage of the number of recent requests. In practice, this functions similarly to Envoy's retry budget implementation, as it is configured in a single location and measures the ratio of retry requests to original requests across all traffic destined for the service. -Linkerd supports [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/), the default way to specify retries to a service, and - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. +Linkerd uses [budgeted retries](https://linkerd.io/2.15/features/retries-and-timeouts/) as the default configuration to specify retries to a service, but - as of [edge-24.7.5](https://github.com/linkerd/linkerd2/releases/tag/edge-24.7.5) - supports counted retries. In all cases, retries are implemented by the `linkerd2-proxy` making the request on behalf on an application workload. -Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. +Linkerd's budgeted retries allow retrying an indefinite number of times, as long as the fraction of retries remains within the budget. Budgeted retries are supported only using Linkerd's native ServiceProfile CRD, which allows enabling retries, setting the retry budget (by default, 20% plus 10 "extra" retries per second), and configuring the window over which the fraction of retries to non-retries is calculated. The `retryBudget` field of the ServiceProfile spec can be configured with the following optional parameters: + +* `retryRatio` Specifies a ratio of retry requests to original requests that is allowed. The default is 0.2, meaning that retries may add up to 20% to the request load. + +* `minRetriesPerSecond` Specifies the minimum rate of retries per second that is allowed, so that retries are not prevented when the request load is very low. The default is 10. + +* `ttl` A duration specifying how long requests are considered for when calculating the retry threshold. The default is 10s. ### Proposed Design @@ -68,7 +74,9 @@ While current retry behavior is defined at the routing rule level within HTTPRou Configuring a retry budget through a Policy Attachment may produce some confusion from a UX perspective, as users will be able to configure retries in two different places (HTTPRoute for static retries, versus a policy attachment for a dynamic retry threshold). Though this is likely a fair trade-off. -Discrepancies in the semantics of retry budget behavior and configuration options between Envoy and Linkerd may require a change in either implementation to accommodate the Gateway API specification. +Discrepancies in the semantics of retry budget behavior and configuration options between Envoy and Linkerd may require a change in either implementation to accommodate the Gateway API specification. While Envoy's `min_retry_concurrency` setting may behave similarly in practice to Linkerd's `minRetriesPerSecond`, they are not directly equivalent. + +A version of Linkerd's `ttl` parameter may also need to be implemented within Envoy. ## API From 1877e2fff45c50dd5a469b2f5d743a5e98c566b3 Mon Sep 17 00:00:00 2001 From: Eric Bishop <60610299+ericdbishop@users.noreply.github.com> Date: Sat, 18 Jan 2025 06:05:57 -0500 Subject: [PATCH 12/13] Apply suggestions from code review Co-authored-by: Mike Morris --- geps/gep-1731/metadata.yaml | 2 +- geps/gep-3388/index.md | 12 +++++++----- geps/gep-3388/metadata.yaml | 2 +- 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/geps/gep-1731/metadata.yaml b/geps/gep-1731/metadata.yaml index e8a7f661e2..2bb398026b 100644 --- a/geps/gep-1731/metadata.yaml +++ b/geps/gep-1731/metadata.yaml @@ -19,7 +19,7 @@ relationships: extends: {} extendedBy: - number: 3388 - name: HTTPRoute Retry Budget + name: Retry Budgets # seeAlso indicates other GEPs that are relevant in some way without being # covered by an existing relationship. seeAlso: diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index 4e9ac4f756..78449d9148 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -1,4 +1,4 @@ -# GEP-3388: HTTP Retry Budget +# GEP-3388: Retry Budgets * Issue: [#3388](https://github.com/kubernetes-sigs/gateway-api/issues/3388) * Status: Provisional @@ -23,10 +23,9 @@ To allow configuration of a "retry budget" across all endpoints of a destination * To allow specifying inclusion of specific HTTP status codes and responses within the retry budget spec. * To allow specification of more than one retry budget for a given service, or for specific subsets of its traffic. - ## Introduction -Multiple data plane proxies offer optional configuration for budgeted retries, in order to create a dynamic limit on the amount of a service's active request that is being retried across its clients. In the case of Linkerd, retry budgets are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/). +Multiple data plane proxies offer optional configuration for budgeted retries, in order to create a dynamic limit on the amount of a service's active request load that is comprised of retries from across its clients. In the case of Linkerd, retry budgets are the default retry policy configuration for HTTP retries within the [ServiceProfile CRD](https://linkerd.io/2.12/reference/service-profiles/), with static max retries being a [fairly recent addition](https://linkerd.io/2024/08/13/announcing-linkerd-2.16/). Configuring a limit for client retries is an important factor in building a resilient system, allowing requests to be successfully retried during periods of intermittent failure. But too many client-side retries can also exacerbate consistent failures and slow down recovery, quickly overwhelming a failing system and leading to cascading failures such as retry storms. Configuring a sane limit for max client-side retries is often challenging in complex systems. Allowing an application developer (Ana) to configure a dynamic "retry budget" reduces the risk of a high number of retries across clients. It allows a service to perform as expected in both times of high & low request load, as well as both during periods of intermittent & consistent failures. @@ -76,7 +75,7 @@ Configuring a retry budget through a Policy Attachment may produce some confusio Discrepancies in the semantics of retry budget behavior and configuration options between Envoy and Linkerd may require a change in either implementation to accommodate the Gateway API specification. While Envoy's `min_retry_concurrency` setting may behave similarly in practice to Linkerd's `minRetriesPerSecond`, they are not directly equivalent. -A version of Linkerd's `ttl` parameter may also need to be implemented within Envoy. +The implementation of a version of Linkerd's `ttl` parameter within Envoy might be a path towards reconciling the behavior of these implementations, as it could allow Envoy to express a `budget_percent` and minimum number of permissible retries over a period of time rather than by tracking active and pending connections. It is not currently clear which of these models is preferable, but being able to specify a budget as requests over a window of time seems like it might offer more predictable behavior. ## API @@ -102,11 +101,14 @@ TODO ## Other considerations -TODO +* Is it worth allowing the budget to be expressed as a `Fraction` similar to `HTTPRequestMirrorFilter` as described in [GEP-3171](https://gateway-api.sigs.k8s.io/geps/gep-3171/), or is a percentage sufficient for this use case? (Expressing a sub-1% budget for retries seems less necessary than for mirroring or redirecting traffic at significant scale.) +* As there isn't anything inherently specific to HTTP requests in either known implementation, a retry budget policy on a target Service could likely be applicable to GRPCRoute as well as HTTPRoute requests. +* While retry budgets are commonly associated with service mesh uses cases to handle many distributed clients, a retry budget policy may also be desirable for north/south implementations of Gateway API to prioritize new inbound requests and minimize tail latency during periods of service instability. ## References * +* * * * diff --git a/geps/gep-3388/metadata.yaml b/geps/gep-3388/metadata.yaml index 4cbc7a6b82..02bbf069b8 100644 --- a/geps/gep-3388/metadata.yaml +++ b/geps/gep-3388/metadata.yaml @@ -1,7 +1,7 @@ apiVersion: internal.gateway.networking.k8s.io/v1alpha1 kind: GEPDetails number: 3388 -name: HTTP Retry Budget +name: Retry Budgets status: Provisional # Any authors who contribute to the GEP in any way should be listed here using # their Github handle. From 9e01592522f82f82dc2afc57d1fa824b813ad40c Mon Sep 17 00:00:00 2001 From: Eric Bishop Date: Tue, 28 Jan 2025 11:54:59 -0500 Subject: [PATCH 13/13] Remove consideration to express budget as a Fraction --- geps/gep-3388/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/geps/gep-3388/index.md b/geps/gep-3388/index.md index 78449d9148..ecf10dcd67 100644 --- a/geps/gep-3388/index.md +++ b/geps/gep-3388/index.md @@ -101,7 +101,6 @@ TODO ## Other considerations -* Is it worth allowing the budget to be expressed as a `Fraction` similar to `HTTPRequestMirrorFilter` as described in [GEP-3171](https://gateway-api.sigs.k8s.io/geps/gep-3171/), or is a percentage sufficient for this use case? (Expressing a sub-1% budget for retries seems less necessary than for mirroring or redirecting traffic at significant scale.) * As there isn't anything inherently specific to HTTP requests in either known implementation, a retry budget policy on a target Service could likely be applicable to GRPCRoute as well as HTTPRoute requests. * While retry budgets are commonly associated with service mesh uses cases to handle many distributed clients, a retry budget policy may also be desirable for north/south implementations of Gateway API to prioritize new inbound requests and minimize tail latency during periods of service instability.