Overload manager doesn't exclude probes from stop_accepting_* actions #23843

caoyukun0430 · 2022-11-04T16:00:25Z

Title: Overload manager should exclude probes from stop_accept_ actions*

Description:
We are recently testing overload manager with istio. Our scenario is to overload the ingressgateway heap_size configured via sending requests from large number of concurrent connections, then we see stop_accepting_connections/requests actions triggered and new requests will fail.
But we found stop_accepting_connections(also stop_accepting_requests) stops the internal probes like liveness probe and readiness probe, which leads to the restart of the ingressgateway pod, which will causes issue for us.

We think it's reasonable to exclude the probes like liveness/readiness probe from overload manager stop_accepting_* actions, because new traffic will still fail and return users error 503 to indicate the overload, there seems to be no need to also stop internal probes.

Or we would like to know if it's intentionally to include probes into stop_accepting_* actions, if so, why?

related topic #21923

phlax · 2022-11-07T09:16:08Z

cc @KBaichoo

hobbytp · 2022-11-08T01:53:04Z

@caoyukun0430 I think this is an istio ingress gateway issue instead of envoy problem?

From my view, when overload manager stop_accepting_* actions are triggered, the istio ingress gateway readinessProbe would also be disabled, but livenessProbe should still work to avoid the pod is deleted by k8s.

caoyukun0430 · 2022-11-08T12:15:13Z

@hobbytp it is a bit vague and could be also related to istio, issue raised. It would be nice to have opinions from both sides.
@KBaichoo would there be any possible improvement that envoy could do to improve the probes so that when overload manager is enabled, probes are excluded from being rejected? Thanks

KBaichoo · 2022-11-08T13:23:54Z

This is reasonable, we can augment the action more broadly to either allow debug IP / headers to decide whether the traffic should be rejected if there's some mechanism the probe traffic has to identify it; This would make sense in fully controlled environments or if there's a scrubbing proxy beforehand.

KBaichoo · 2022-11-08T13:24:48Z

This is similar to #20002

cc @nezdolik who might also be interested

caoyukun0430 · 2022-11-10T13:03:02Z

Hi @KBaichoo thanks for the reply! As what we understand from Istio istio/istio#41859 (comment), the probe health checking logic on port 15021 first reach envoy then request to pilot-agent on port 15020 to get stats from envoy to see if started or not. But in overload condition, the first step, which is request to 15021 is rejected by overload action.
We want to propose that at least it would be reasonable to exclude port 15021 from stop_* overload actions so that health checking of ingressgateway will still work.

Furthermore, we would think it would be better that envoy can provide configurable port value to be excluded from overload stopping actions. The use case would be that when overload manager is configured on application pod sidecar, application could have its own health checking port/mechanism other than port 15021, then it would ask be rejected when action is triggered, but with the configurable excludePort for overload action, it could avoid the issue.

nezdolik · 2022-11-10T20:36:00Z

So far looks like the use case has been about letting in liveness probes to Envoy during overload.
To what @KBaichoo has already said, am thinking if we could be more flexible here in terms of configuration to accommodate for broader use cases.
The problem could be generalised to accept certain categories of traffic during overload state based on certain criteria.
Examples of criteria could be listener/port, cluster, cluster+route, combination of headers etc.

caoyukun0430 · 2022-11-21T08:34:08Z

Hi @nezdolik @KBaichoo is there any update here or is there any plan for some PR/fix for this one? thanks a lot! :)

nezdolik · 2022-11-25T13:27:47Z

@caoyukun0430 at the moment no one is working on this issue. I may pick it up in couple of weeks unless someone else does. You are also welcome to submit a patch :)

caoyukun0430 · 2023-01-17T14:16:46Z

Hi @nezdolik is there any update, did you find some time maybe to work on this issue? Thanks

briansonnenberg · 2023-09-08T16:57:25Z

I'd like to pick this issue up. cc @kyessenov

kyessenov · 2023-09-08T17:14:12Z

Go for it! I agree we want to keep an overloaded envoy to run as long as it can so it must pass the probes. If we crash/delete it too early, k8s cannot scale up fast enough.

caoyukun0430 added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Nov 4, 2022

phlax added area/overload_manager and removed triage Issue requires triage labels Nov 7, 2022

caoyukun0430 mentioned this issue Nov 8, 2022

Probes are not excluded from overload manager actions istio/istio#41859

Closed

KBaichoo added help wanted Needs help! area/health_checking labels Nov 8, 2022

KBaichoo assigned briansonnenberg Sep 8, 2023

briansonnenberg added a commit to briansonnenberg/envoy that referenced this issue Sep 21, 2023

First pass at envoyproxy#23843

96a43c3

briansonnenberg added a commit to briansonnenberg/envoy that referenced this issue Sep 21, 2023

Switch to using null_overload_manager_ pattern for envoyproxy#23843

23a4c53

briansonnenberg mentioned this issue Sep 28, 2023

Overload manager bypass flag for listeners #29781

Closed

sekar-saravanan mentioned this issue Mar 6, 2024

Emissary Ingress Readiness/Liveness Probe emissary-ingress/emissary#5588

Closed

singamL887 mentioned this issue Sep 5, 2024

Backport of bypass overload manager flag to envoy 1.29 #35985

Closed

guydc mentioned this issue Feb 11, 2025

Support exclusion of listeners from overload manager envoyproxy/gateway#5260

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overload manager doesn't exclude probes from stop_accepting_* actions #23843

Overload manager doesn't exclude probes from stop_accepting_* actions #23843

caoyukun0430 commented Nov 4, 2022

phlax commented Nov 7, 2022

hobbytp commented Nov 8, 2022

caoyukun0430 commented Nov 8, 2022

KBaichoo commented Nov 8, 2022

KBaichoo commented Nov 8, 2022

caoyukun0430 commented Nov 10, 2022

nezdolik commented Nov 10, 2022

caoyukun0430 commented Nov 21, 2022 •

edited

Loading

nezdolik commented Nov 25, 2022

caoyukun0430 commented Jan 17, 2023

briansonnenberg commented Sep 8, 2023

kyessenov commented Sep 8, 2023

Overload manager doesn't exclude probes from stop_accepting_* actions #23843

Overload manager doesn't exclude probes from stop_accepting_* actions #23843

Comments

caoyukun0430 commented Nov 4, 2022

phlax commented Nov 7, 2022

hobbytp commented Nov 8, 2022

caoyukun0430 commented Nov 8, 2022

KBaichoo commented Nov 8, 2022

KBaichoo commented Nov 8, 2022

caoyukun0430 commented Nov 10, 2022

nezdolik commented Nov 10, 2022

caoyukun0430 commented Nov 21, 2022 • edited Loading

nezdolik commented Nov 25, 2022

caoyukun0430 commented Jan 17, 2023

briansonnenberg commented Sep 8, 2023

kyessenov commented Sep 8, 2023

caoyukun0430 commented Nov 21, 2022 •

edited

Loading