Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overload manager doesn't exclude probes from stop_accepting_* actions #23843

Open
caoyukun0430 opened this issue Nov 4, 2022 · 12 comments
Open
Assignees
Labels
area/health_checking area/overload_manager enhancement Feature requests. Not bugs or questions. help wanted Needs help!

Comments

@caoyukun0430
Copy link

Title: Overload manager should exclude probes from stop_accept_ actions*

Description:
We are recently testing overload manager with istio. Our scenario is to overload the ingressgateway heap_size configured via sending requests from large number of concurrent connections, then we see stop_accepting_connections/requests actions triggered and new requests will fail.
But we found stop_accepting_connections(also stop_accepting_requests) stops the internal probes like liveness probe and readiness probe, which leads to the restart of the ingressgateway pod, which will causes issue for us.

We think it's reasonable to exclude the probes like liveness/readiness probe from overload manager stop_accepting_* actions, because new traffic will still fail and return users error 503 to indicate the overload, there seems to be no need to also stop internal probes.

Or we would like to know if it's intentionally to include probes into stop_accepting_* actions, if so, why?

related topic #21923

@caoyukun0430 caoyukun0430 added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Nov 4, 2022
@phlax
Copy link
Member

phlax commented Nov 7, 2022

cc @KBaichoo

@phlax phlax added area/overload_manager and removed triage Issue requires triage labels Nov 7, 2022
@hobbytp
Copy link

hobbytp commented Nov 8, 2022

@caoyukun0430 I think this is an istio ingress gateway issue instead of envoy problem?

From my view, when overload manager stop_accepting_* actions are triggered, the istio ingress gateway readinessProbe would also be disabled, but livenessProbe should still work to avoid the pod is deleted by k8s.

@caoyukun0430
Copy link
Author

@hobbytp it is a bit vague and could be also related to istio, issue raised. It would be nice to have opinions from both sides.
@KBaichoo would there be any possible improvement that envoy could do to improve the probes so that when overload manager is enabled, probes are excluded from being rejected? Thanks

@KBaichoo
Copy link
Contributor

KBaichoo commented Nov 8, 2022

This is reasonable, we can augment the action more broadly to either allow debug IP / headers to decide whether the traffic should be rejected if there's some mechanism the probe traffic has to identify it; This would make sense in fully controlled environments or if there's a scrubbing proxy beforehand.

@KBaichoo
Copy link
Contributor

KBaichoo commented Nov 8, 2022

This is similar to #20002

cc @nezdolik who might also be interested

@caoyukun0430
Copy link
Author

Hi @KBaichoo thanks for the reply! As what we understand from Istio istio/istio#41859 (comment), the probe health checking logic on port 15021 first reach envoy then request to pilot-agent on port 15020 to get stats from envoy to see if started or not. But in overload condition, the first step, which is request to 15021 is rejected by overload action.
We want to propose that at least it would be reasonable to exclude port 15021 from stop_* overload actions so that health checking of ingressgateway will still work.

Furthermore, we would think it would be better that envoy can provide configurable port value to be excluded from overload stopping actions. The use case would be that when overload manager is configured on application pod sidecar, application could have its own health checking port/mechanism other than port 15021, then it would ask be rejected when action is triggered, but with the configurable excludePort for overload action, it could avoid the issue.

@nezdolik
Copy link
Member

So far looks like the use case has been about letting in liveness probes to Envoy during overload.
To what @KBaichoo has already said, am thinking if we could be more flexible here in terms of configuration to accommodate for broader use cases.
The problem could be generalised to accept certain categories of traffic during overload state based on certain criteria.
Examples of criteria could be listener/port, cluster, cluster+route, combination of headers etc.

@caoyukun0430
Copy link
Author

caoyukun0430 commented Nov 21, 2022

Hi @nezdolik @KBaichoo is there any update here or is there any plan for some PR/fix for this one? thanks a lot! :)

@nezdolik
Copy link
Member

@caoyukun0430 at the moment no one is working on this issue. I may pick it up in couple of weeks unless someone else does. You are also welcome to submit a patch :)

@caoyukun0430
Copy link
Author

Hi @nezdolik is there any update, did you find some time maybe to work on this issue? Thanks

@briansonnenberg
Copy link
Contributor

I'd like to pick this issue up. cc @kyessenov

@kyessenov
Copy link
Contributor

Go for it! I agree we want to keep an overloaded envoy to run as long as it can so it must pass the probes. If we crash/delete it too early, k8s cannot scale up fast enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/health_checking area/overload_manager enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Projects
None yet
Development

No branches or pull requests

7 participants