Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support exclusion of listeners from overload manager #5260

Open
guydc opened this issue Feb 11, 2025 · 5 comments
Open

Support exclusion of listeners from overload manager #5260

guydc opened this issue Feb 11, 2025 · 5 comments
Assignees
Milestone

Comments

@guydc
Copy link
Contributor

guydc commented Feb 11, 2025

Description:
Currently, EG generates overload manager settings that stop accepting requests when heap utilization is high:

- name: "envoy.overload_actions.stop_accepting_requests"

This may have an unexpected side effect on the Envoy proxy pod's readiness probes, leading to probe failure, e.g. as seen here: envoyproxy/envoy#23843. Since liveness probes are not used by EG, the pod would not be restarted.

However, even a readiness probe failure may in some conditions lead to wide service disruption. For example, if using ExternalTrafficPolicy: Local and a single envoy pod in every node, an unready pod would trigger a failure of the NLB Health Check (https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/). Some NLBs, e.g. AWS NLB, are documented to reset all connections to that node in such cases:

If a target becomes unhealthy, the load balancer sends a TCP RST for packets received on the client connections associated with the target

EG Should:

  1. make it possible to exclude readiness, stats and other system listeners from overload manager
  2. make it possible to exclude user-defined listeners from overload manager, in case custom HCs are implemented

Envoy supports excluding listeners from overload manager: https://www.envoyproxy.io/docs/envoy/v1.31.5/api-v3/config/listener/v3/listener.proto.html#envoy-v3-api-field-config-listener-v3-listener-bypass-overload-manager

[optional Relevant Links:]

Any extra documentation required to understand the issue.

@guydc guydc added the triage label Feb 11, 2025
@arkodg arkodg added help wanted Extra attention is needed and removed triage labels Feb 11, 2025
@arkodg arkodg added this to the v1.4.0-rc.1 milestone Feb 11, 2025
@arkodg
Copy link
Contributor

arkodg commented Feb 11, 2025

+1 to defaulting exclusion of the stats and readiness listener

@rudrakhp
Copy link
Member

rudrakhp commented Feb 20, 2025

I want to give this a shot, quick question though:

make it possible to exclude user-defined listeners from overload manager, in case custom HCs are implemented

Maybe I don't have the full picture, but was wondering how this can be achieved, given we are confirmed by the Gateway API spec. Would an annotation in the Gateway object do the job? For example, an annotation gateway.envoyproxy.io/bypass-overload-manager-listeners would expect a comma separated list of listener names that are to be bypassed by the overload manager.

@guydc
Copy link
Contributor Author

guydc commented Feb 20, 2025

Hi @rudrakhp.

ClientTrafficPolicy seems to support attachment to a specific GW-API listener within a GW-API Gateway. So, we can make that a feature of CTP.

The challenge here might be handling contradicting definitions when the same listener is used in multiple GWs which are merged. It's a common problem though, for all settings that map to the Envoy Listener resource and not the Envoy Filter Chain (e.g. like downstream buffersize).

Another option, similar to what you proposed, is to use an explicit list of listener references (Incl. GW name, namespace and section (==listener) name) that should be excluded. That's something that could reside in EnvoyProxy CR and would be scoped to all GWs in the relevant GWC.

@arkodg
Copy link
Contributor

arkodg commented Feb 20, 2025

can we scope this down to disabling overload manager for the 2 static listeners (prometheus and health check) ?

@guydc
Copy link
Contributor Author

guydc commented Feb 21, 2025

can we scope this down to disabling overload manager for the 2 static listeners (prometheus and health check) ?

We can break it up, the first bit would be great for 1.4. The rest can come at a later phase.

I think that we do support the healthcheck filter and customizing the k8s probes, so the second bit is still relevant for some users, but I imagine that patch policy can be used as mitigation until this is explicitly supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants