Configurable Retries in HTTPRoute #1731

robscott · 2023-02-15T19:08:13Z

What would you like to be added:
I would like to be able to configure the following in HTTPRoute:

The max number of times to retry a request
The reason(s) and/or status codes a request should be retried
The timeout for each retry attempt

I believe all 3 of these would be implementable for Envoy based implementations, the first 2 would be implementable for HAProxy based implementations, unclear what would be implementable for NGINX or others (cc @pleshakov @shaneutt).

Why this is needed:
This is a common feature request and represents a concept that would likely get tied to a variety of custom policies if we did not include it in the main API.

Xunzhuo · 2023-03-08T04:28:31Z

/assign
/remove-help

kflynn · 2023-03-09T20:23:59Z

@robscott, on the mesh front, Linkerd actually doesn't support the fixed-number-of-retries concept. Instead there's a retry budget: if your retry budget is e.g. 20%, then as long as not more than 20% of your request volume is retries, Linkerd can continue retrying.

It would be lovely to be able to configure that form of retry, too, without having to resort to crazy custom stuff. Just to complicate your life. 😉

bowei · 2023-03-10T19:57:06Z

That's a very interesting approach. Is the 20% calculated "globally" or on a-per proxy basis? This may be difficult for more distributed proxy systems to do for a global percentage.

dprotaso · 2023-03-11T02:15:39Z

@kflynn does setting 0% disable retries?

ramaraochavali · 2023-05-08T12:10:35Z

https://www.envoyproxy.io/docs/envoy/v1.26.1/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-field-config-cluster-v3-circuitbreakers-thresholds-retry-budget - Envoy also supports retry budgets. If you do not specify any value for this, it disables retries

robscott · 2023-05-08T16:58:43Z

@ramaraochavali thanks for the reference! I'd assumed this would be implemented by RetryPolicy in xDS which does not seem to require retry budgets, but I'm also far from an Envoy expert so may be missing some nuance here.

ramaraochavali · 2023-05-10T09:13:24Z

it is implemented via thresholds at cluster level https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#config-cluster-v3-circuitbreakers-thresholds

frankbu · 2023-06-23T20:06:28Z

Although configuring retry budgets and other circuit breaking mechanisms would be useful to support, I don't think they are typically configurable on the individual route (vs service) level. Since the title of this issue is about configuring retries on HTTPRoute specifically, I wonder if it make sense to start with a GEP that only addresses retry configuration on an HTTPRoute and then handle retry budgets, etc. (probably using policy attachments instead of explicit fields in the API) in a separate GEP.

@robscott If you want to assign the issue to me, I can put together a first pass GEP to discuss this further.

robscott · 2023-06-23T23:26:53Z

Thanks @frankbu!

Although configuring retry budgets and other circuit breaking mechanisms would be useful to support, I don't think they are typically configurable on the individual route (vs service) level.

I'm certainly biased because GCP load balancers configure retries at the routing layer. My understanding is that both HAProxy and Envoy are also capable of this, but definitely could be wrong on either of those.

I wonder if it make sense to start with a GEP that only addresses retry configuration on an HTTPRoute and then handle retry budgets, etc

That approach makes sense to me. In general, we want to include concepts in the API that are portable and have a path for >50% of implementations to support. I think what you've recommended starting with would meet that criteria, but I'm not sure retry budgets have the same portability right now (again could be wrong on that).

I think a GEP is a great idea here, and similar to timeouts and session affinity, it would likely be helpful to provide an overview of the current state of the world in that GEP before going too far with details. Thanks for volunteering to help out with this!

/assign @frankbu

k8s-triage-robot · 2024-03-19T09:47:59Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

mikemorris · 2024-03-26T01:23:22Z

Discussed this during a meeting of Gateway API maintainers at KubeCon EU 2024 as a potential priority for Gateway API v1.2, first step would be a memorandum GEP documenting existing configuration and behavior across a range of implementations.

/remove-lifecycle rotten
/assign @mikemorris

achetronic · 2024-06-05T08:24:27Z

Confirmed as a needed feature. This is affecting us too (using Istio as implementation). How is the status of this on May? :)

shaneutt · 2024-06-05T11:54:00Z

Thanks for confirming your desire for this feature.

How is the status of this on May? :)

Can you help me to better understand this? I'm uncertain what is meant 🤔

mikemorris · 2024-06-05T14:13:49Z

We're proposing this for inclusion in the Gateway API v1.2 scope as an experimental feature. I'm not quite sure of the expected release date for v1.2, but I'd expect roughly fall/late Q3 2024.

achetronic · 2024-06-10T07:10:05Z

Sorry for the delay, guys

Thanks for confirming your desire for this feature.

How is the status of this on May? :)

Can you help me to better understand this? I'm uncertain what is meant 🤔

@shaneutt I meant "how is the current status for that?" sorry for the miss-explanation

We're proposing this for inclusion in the Gateway API v1.2 scope as an experimental feature. I'm not quite sure of the expected release date for v1.2, but I'd expect roughly fall/late Q3 2024.

Thank you for the info! I will check it

robscott · 2024-06-10T16:48:21Z

If anyone's interested in seeing this in scope for v1.2, please upvote and/or comment on Mike's v1.2 scoping proposal.

achetronic · 2024-06-11T08:58:46Z

If anyone's interested in seeing this in scope for v1.2, please upvote and/or comment on Mike's v1.2 scoping proposal.

Done! thank you for clarifying this

robscott · 2024-08-02T17:21:43Z

/reopen to track lifecycle of GEP

robscott added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 15, 2023

shaneutt added this to the v0.7.0 milestone Feb 15, 2023

shaneutt added this to Gateway API: The Road to GA Feb 21, 2023

shaneutt moved this to Todo in Gateway API: The Road to GA Feb 21, 2023

k8s-ci-robot assigned Xunzhuo Mar 8, 2023

k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 8, 2023

shaneutt moved this from Todo to In Progress in Gateway API: The Road to GA Mar 8, 2023

robscott modified the milestones: v0.7.0, v0.8.0 Mar 24, 2023

frankbu mentioned this issue Apr 27, 2023

gateway-api: add timeouts and retries support istio/istio#44571

Closed

robscott mentioned this issue May 1, 2023

gateway-api: add experimental API for Istio extensions istio/api#2770

Closed

frankbu mentioned this issue May 4, 2023

GEP-1742: HTTPRoute Timeouts API #1997

Merged

shaneutt unassigned Xunzhuo May 18, 2023

shaneutt added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 18, 2023

shaneutt removed this from the v0.8.0 milestone May 18, 2023

shaneutt removed this from Gateway API: The Road to GA May 18, 2023

robscott moved this from Proposed to Experimental in Gateway API Enhancement Proposals (GEPs) Mar 12, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 19, 2024

k8s-ci-robot assigned mikemorris Mar 26, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 26, 2024

zirain mentioned this issue Apr 22, 2024

Expose retry budget configuration istio/istio#27419

Open

robscott moved this from Experimental to Provisional in Gateway API Enhancement Proposals (GEPs) May 20, 2024

robscott moved this from Provisional to Proposed in Gateway API Enhancement Proposals (GEPs) May 20, 2024

This was referenced May 28, 2024

gateway-api: support VirtualService-kind routes on k8s gateways istio/istio#47101

Closed

Support Retry Budget istio/api#2734

Closed

robscott added the kind/gep PRs related to Gateway Enhancement Proposal(GEP) label Jun 11, 2024

robscott added this to the v1.2.0 milestone Jun 11, 2024

mikemorris mentioned this issue Jul 16, 2024

gep: add GEP-1731 Configurable Retries #3199

Merged

k8s-ci-robot closed this as completed in #3199 Aug 2, 2024

github-project-automation bot moved this from Proposed to Implementable in Gateway API Enhancement Proposals (GEPs) Aug 2, 2024

robscott reopened this Aug 2, 2024

mikemorris mentioned this issue Aug 27, 2024

apis: add implementation for GEP-1731 HTTPRoute Retries #3301

Merged

k8s-ci-robot closed this as completed in #3301 Aug 31, 2024

rajatsharma94 mentioned this issue Oct 10, 2024

Retry Budgets in HTTPRouteRetry #3388

Open

mikemorris moved this from Implementable to Experimental in Gateway API Enhancement Proposals (GEPs) Oct 22, 2024

ericdbishop mentioned this issue Dec 8, 2024

gep: add GEP-3388 HTTPRoute Retry Budget #3488

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable Retries in HTTPRoute #1731

Configurable Retries in HTTPRoute #1731

robscott commented Feb 15, 2023

Xunzhuo commented Mar 8, 2023

kflynn commented Mar 9, 2023

bowei commented Mar 10, 2023

dprotaso commented Mar 11, 2023

ramaraochavali commented May 8, 2023

robscott commented May 8, 2023

ramaraochavali commented May 10, 2023

frankbu commented Jun 23, 2023

robscott commented Jun 23, 2023 •

edited

Loading

k8s-triage-robot commented Mar 19, 2024

mikemorris commented Mar 26, 2024

achetronic commented Jun 5, 2024

shaneutt commented Jun 5, 2024 •

edited

Loading

mikemorris commented Jun 5, 2024 •

edited

Loading

achetronic commented Jun 10, 2024 •

edited

Loading

robscott commented Jun 10, 2024

achetronic commented Jun 11, 2024

robscott commented Aug 2, 2024

Configurable Retries in HTTPRoute #1731

Configurable Retries in HTTPRoute #1731

Comments

robscott commented Feb 15, 2023

Xunzhuo commented Mar 8, 2023

kflynn commented Mar 9, 2023

bowei commented Mar 10, 2023

dprotaso commented Mar 11, 2023

ramaraochavali commented May 8, 2023

robscott commented May 8, 2023

ramaraochavali commented May 10, 2023

frankbu commented Jun 23, 2023

robscott commented Jun 23, 2023 • edited Loading

k8s-triage-robot commented Mar 19, 2024

mikemorris commented Mar 26, 2024

achetronic commented Jun 5, 2024

shaneutt commented Jun 5, 2024 • edited Loading

mikemorris commented Jun 5, 2024 • edited Loading

achetronic commented Jun 10, 2024 • edited Loading

robscott commented Jun 10, 2024

achetronic commented Jun 11, 2024

robscott commented Aug 2, 2024

robscott commented Jun 23, 2023 •

edited

Loading

shaneutt commented Jun 5, 2024 •

edited

Loading

mikemorris commented Jun 5, 2024 •

edited

Loading

achetronic commented Jun 10, 2024 •

edited

Loading