Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Retry Budget #2734

Closed
wants to merge 2 commits into from
Closed

Support Retry Budget #2734

wants to merge 2 commits into from

Conversation

quraynai
Copy link

Open new MR to solve conflict. Previous MR #2585

Issue: istio/istio#27419

Istio: istio/istio#42232

@quraynai quraynai requested a review from a team as a code owner March 20, 2023 10:55
@istio-policy-bot
Copy link

😊 Welcome @quraynai! This is either your first contribution to the Istio api repo, or it's been
awhile since you've been here.

You can learn more about the Istio working groups, code of conduct, and contributing guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 20, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@istio-testing istio-testing added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test labels Mar 20, 2023
@istio-testing
Copy link
Collaborator

Hi @quraynai. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@linsun
Copy link
Member

linsun commented Mar 22, 2023

/ok-to-test

@istio-testing istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels Mar 22, 2023
@linsun
Copy link
Member

linsun commented Mar 22, 2023

@quraynai re your message on the initial PR, i think having istioctl analyze to report the mismatch (between VS and DR) is reasonable. I can't think of a better way to represent this either - I don't want to hold up your PR, unless someone else has a better way.

@rafatbiin
Copy link

rafatbiin commented Apr 17, 2023

hi @linsun & @quraynai, this is a bit unrelated, but how can you configure it for your service where the retry budget depends on something like active requests and active pending requests which is very low level and not easy to measure compared to the number of requests we're sending to upstream from a client. The same goes for min retry concurrency which we've less control over than retry per second(used by linkerd: https://linkerd.io/2.12/tasks/configuring-retries/#retry-budgets they probably implemented the idea of finagle version's of retry budget: https://finagle.github.io/blog/2016/02/08/retry-budgets/ which makes more sense to me). I've opened a discussion on the envoy repository but got no response from the maintainer unfortunately to discuss how we can measure and configure it for a production service.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Aug 3, 2023
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Sep 25, 2023
@quraynai
Copy link
Author

@rafatbiin I think number of request is harder to configure and measure compare to number of active connections.
Retries base on number of request we might need to know RPS and interval that we want to measure and it's hard to make retry budget work if we configure incorrectly.

@quraynai
Copy link
Author

@howardjohn @linsun sorry for no activities with this MR for a while.
We made our forked version and POC for a while. Could you please help review this MR again?

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Oct 5, 2023
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Oct 24, 2023
@quraynai
Copy link
Author

@howardjohn @linsun hi could you please help review on this PR?

@rafatbiin
Copy link

@rafatbiin I think number of request is harder to configure and measure compare to number of active connections. Retries base on number of request we might need to know RPS and interval that we want to measure and it's hard to make retry budget work if we configure incorrectly.

@quraynai sorry for the late reply. when you said number of connections, you meant active requests, right? how can we measure this in a reliable way to configure the retry budget percentage? because, from my experience, it's very low level and hard to measure. it can literally changes from 100 to 0 in a second and depending on when envoy is taking the measurement, you'll get a totally different retry budget percentage. it's specially true with service which doesn't have a lot of outgoing requests. because if you consider the average case, the retry budget percentage will be very high and when we'll have real retry storm, it'll actually bite us. do you have any suggestion on how to go about measuring this metrics?

@hzxuzhonghu
Copy link
Member

This is a very advanced setting provided by envoy, we haven't had very critical requirements for it

@quraynai
Copy link
Author

quraynai commented Nov 7, 2023

@hzxuzhonghu I understand about it's not critical to have this functionality at the moment from istio.
but it's kinda problem to configure it from EnvoyFilter. You can check more detail from this link
istio/istio#27419

It's a simple config that we tried and we don't issue about retry storm so far

@quraynai
Copy link
Author

quraynai commented Nov 7, 2023

@rafatbiin maybe we can try with static number and readjust as global.
Understand tuning this value dynamically per service is really hard and might not worth it.

@quraynai
Copy link
Author

@hzxuzhonghu @linsun @howardjohn hi could you please help on this?

@howardjohn
Copy link
Member

I am concerned about this as retries are in VirtualService, so putting retry budget in DR is odd. It is to align with Envoy semantics only, not really designed from users perspective. Gateway-API is also looking into adding this in HTTPRoute; if we put it in DR it will be incompatible cc @mikemorris.

That being said, I don't know a way to put it in route without envoy changes.

@quraynai
Copy link
Author

@howardjohn I understand your concern. we will look into another option

@quraynai quraynai closed this Nov 29, 2023
@ericdbishop
Copy link
Member

I am concerned about this as retries are in VirtualService, so putting retry budget in DR is odd. It is to align with Envoy semantics only, not really designed from users perspective. Gateway-API is also looking into adding this in HTTPRoute; if we put it in DR it will be incompatible cc @mikemorris.

That being said, I don't know a way to put it in route without envoy changes.

@howardjohn We are interested in having access to the retry_budget circuit breaker threshold via the DR api, as max_retries is a static value that requires tuning for every service within the Mesh, making it difficult for us to update when services begin receiving more traffic. In comparison, retry_budget configurations require no major tuning since the retry value is relative, and it also cuts new requests earlier than using max_retries, performing better in our tests. But using retry_budget via EnvoyFilter's is extremely difficult to maintain across our services. Do you think there is any path forward to exposing this within DestinationRule we could work on? Thanks.

@mikemorris
Copy link
Member

mikemorris commented May 31, 2024

@ericdbishop We're starting to look at implementing retries (including retry budgets) upstream in Gateway API in kubernetes-sigs/gateway-api#1731

One thing I'd like to get more clarity on is where/how this should be implemented - my initial thought was as a field on HTTPRoute (similar to Timeouts), but I've seen a few mentions of DestinationRule or client-side config which I'd like to better understand the implications for (or any Envoy implementation constraints) as it would relate to both Ambient mesh without client sidecars and reconciling this functionality with other Gateway API implementations such as Linkerd's retry budgets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants