Overload manager bypass flag for listeners #29781

briansonnenberg · 2023-09-22T19:54:30Z

WIP + missing tests and release notes. Opening for discussion.

Pulled the NullOverloadManager instantiation out of admin and moved it into server, so that it can be used across any listeners that want to bypass overload manager. A health probe listener for example.

On a related note, while working on this I tried a "stop_accepting_requests" action in conjunction with "shrink_heap" and it seems once memory crosses the threshold, it never comes back down regardless of the shrink heap action, so it just keeps rejecting requests indefinitely. I'm not really sure what the point of the overload manager with memory thresholds is if the memory never comes back down. This seems true for both gperftools and tcmalloc.

repokitteh-read-only · 2023-09-22T19:54:41Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy[\w/]*/(v1alpha\d?|v1|v2alpha\d?|v2))|(api/envoy/type/(matcher/)?\w+.proto).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #29781 was opened by briansonnenberg.

see: more, trace.

jmarantz · 2023-09-28T13:19:08Z

I flipped this over to draft as it claims to be a WiP. Moreover I don't fully understand why we'd need a code change to make a null overload manager. If OM is not working for you can't you just remove the OM config?

Anyway I think the issue as described above is not that the OM isn't working; it's that the stock memory tracking mechanism isn't working, though this may be dependent on how Envoy is built.

briansonnenberg · 2023-09-28T16:14:38Z

I flipped this over to draft as it claims to be a WiP. Moreover I don't fully understand why we'd need a code change to make a null overload manager. If OM is not working for you can't you just remove the OM config?

This code change isn't to address the issue I described—that is a related thing that I found while testing this. This is for issue #23843.

Anyway I think the issue as described above is not that the OM isn't working; it's that the stock memory tracking mechanism isn't working, though this may be dependent on how Envoy is built.

I agree.

github-actions · 2023-10-28T20:01:22Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2023-11-05T00:03:09Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

KBaichoo · 2023-11-17T21:26:18Z

Sorry @briansonnenberg , I think I lost track of this as it was marked as WIP. Moving it to ready for review so I remember to review it.

kyessenov · 2023-11-20T20:14:12Z

/wait-for any

KBaichoo

Thank you for working on this @briansonnenberg !

Here's a first pass.
/wait

KBaichoo · 2023-11-20T22:26:46Z

api/envoy/api/v2/listener.proto

+  // Specifies if traffic accepted by this listener should be allowed in
+  // overload scenarios (e.g. listener handles health probes or otherwise
+  // critical traffic).  Default: false
+  bool bypass_overload_manager = 23;


do we still make changes to v2? I think it's frozen and shouldn't be changed.

KBaichoo · 2023-11-20T22:31:15Z

envoy/network/listener.h

+  /**
+   * Check whether the listener should bypass overload manager actions
+   */
+  virtual bool getBypassOverloadManager() PURE;


this is kind of confusingly named since we're not returning a overload manager or so. e.g. getX doesn't return X.

Consider: shouldBypassOverloadManager or similar if we need to return a bool

KBaichoo · 2023-11-20T22:34:55Z

envoy/server/factory_context.h

+   * @return NullOverloadManager& the dummy overload manager for the server for
+   * listeners that are bypassing a configured OverloadManager
+   */
+  virtual NullOverloadManager& nullOverloadManager() PURE;


maybe interface should return an OverloadManager& instead of this subclass so that it's "opaque" whether we have one or the other. This would also allow us push the implementation outside of the interface file as currently done.

KBaichoo · 2023-11-20T22:35:38Z

envoy/server/overload/overload_manager.h

+ * Implementation of OverloadManager that is never overloaded. Using this instead of the real
+ * OverloadManager keeps the interface accessible even when the proxy is overloaded.
+ */
+struct NullOverloadManager : public OverloadManager {


Can you move this to the overload_manager_impl file to follow the standard convention within the codebase?

github-actions · 2023-12-31T16:01:10Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

KBaichoo · 2024-01-02T14:27:47Z

any additional progress on this @briansonnenberg ?

briansonnenberg · 2024-01-05T20:15:57Z

@KBaichoo Yeah, looking at this it looks like there was a pretty large refactor that is causing a lot of conflicts with these changes. Looking into it now.

caoyukun0430 · 2024-01-23T11:20:22Z

Hi @briansonnenberg , thanks a lot for picking my issue up, great work!
But I have some questions about code changes:
If I understand correctly connection_handler_impl.cc, as long as bypassoverloadmanager flag is enabled., listener works as "normal", meaning like there is no overloadMgr configured.
My concern is that this might be too "broad" as all listeners are allowed without restriction. My initial idea also as people proposed in the issue #23843 (comment) is that bypassing can be allowed under certain criteria: e.g. combination of IP+port(for my issue 15021 health check port)

Allowing listeners other than probes to bypass might also hinder an overloaded envoy running as long as it can as @kyessenov mentioned. Considering envoy listening on 15021 for liveness probe and 8000 for HTTP traffic. I would suggest we only bypass the 15021 port in the overload case as the 8000 is the source of overloading and it should not be bypassed.

Your opinions are appreciated! thanks!

davinci26 · 2024-01-31T14:24:53Z

Hey @briansonnenberg thanks for working on this.

I am also coming from the angle of https://github.com/projectcontour/contour which would also would love to use this feature because it follows a similar configuration pattern of exposing a subset of admin interface via a dedicated listener. And we would like to exempt this listener from overload manager to avoid k8s liveness/readiness/stats getting shedded when the system is overloaded.

@caoyukun0430

My initial idea also as people proposed in the issue #23843 (comment) is that bypassing can be allowed under certain criteria: e.g. combination of IP+port(for my issue 15021 health check port)

That makes overall sense to me.

With respect to port do you mean source port? Because if it is the port that Envoy listens to isn't that basically what we implement here? I assume 15021 is a dedicated port for health checks and you have a dedicated envoy listener listening on that port? Or am I missing something?
With regards to source IP that can be allowlisted instead of allowing all traffic that also makes sense but the question is that in order to have IP information we need to work on all incoming requests i.e. accept the request. So there is a cost associated to it that might push Envoy and that might defeat the purpose of overload manager. This depends on the implementation details.

From my perspective this PR seems like it can cover 95% of the usecase but for the remaining we need some thought and this is useful as it stands. So I wonder if we can just change the API surface:

message  OverloadManagerListenerConfig {
     bool bypass_overload_manager = 1;
}

OverloadManagerListenerConfig overload_manager_listener_config

And that way we can extend the config to capture all of those use cases in the future by having a dedicated api config.

davinci26 · 2024-01-31T14:26:30Z

In any case @briansonnenberg if you end up becoming too busy, life got in the way and want to hand this over.

I am happy to make a PR on top of your PR to help merge this.

briansonnenberg · 2024-01-31T17:05:25Z

From my perspective this PR seems like it can cover 95% of the usecase but for the remaining we need some thought and this is useful as it stands. So I wonder if we can just change the API surface:
message  OverloadManagerListenerConfig {
     bool bypass_overload_manager = 1;
}
OverloadManagerListenerConfig overload_manager_listener_config
And that way we can extend the config to capture all of those use cases in the future by having a dedicated api config.

I like this, and agree that it's good enough for most use cases as it currently is. Feel free to add to this. I will get around to it eventually, but have different work prioritized currently.

caoyukun0430 · 2024-02-01T21:17:58Z

Hey @briansonnenberg thanks for working on this.

I am also coming from the angle of https://github.com/projectcontour/contour which would also would love to use this feature because it follows a similar configuration pattern of exposing a subset of admin interface via a dedicated listener. And we would like to exempt this listener from overload manager to avoid k8s liveness/readiness/stats getting shedded when the system is overloaded.

@caoyukun0430

My initial idea also as people proposed in the issue #23843 (comment) is that bypassing can be allowed under certain criteria: e.g. combination of IP+port(for my issue 15021 health check port)

That makes overall sense to me.

With respect to port do you mean source port? Because if it is the port that Envoy listens to isn't that basically what we implement here? I assume 15021 is a dedicated port for health checks and you have a dedicated envoy listener listening on that port? Or am I missing something?

With regards to source IP that can be allowlisted instead of allowing all traffic that also makes sense but the question is that in order to have IP information we need to work on all incoming requests i.e. accept the request. So there is a cost associated to it that might push Envoy and that might defeat the purpose of overload manager. This depends on the implementation details.

From my perspective this PR seems like it can cover 95% of the usecase but for the remaining we need some thought and this is useful as it stands. So I wonder if we can just change the API surface:
message  OverloadManagerListenerConfig {
     bool bypass_overload_manager = 1;
}
OverloadManagerListenerConfig overload_manager_listener_config
And that way we can extend the config to capture all of those use cases in the future by having a dedicated api config.

Hi @davinci26 , thanks for the comments! It's nice to have you and @briansonnenberg willing to discuss and pick up. Overall, I agree with your points, about IPs, yes, filtering over IP can add overhead and it might be not what we want under overload manager scenarios.
But about the options of port. I think I might have misunderstood before, so the bypass_overload_manager bool flag can be configured per listener, and it should fulfill the need to be enabled on the health check 15021 listener :)

davinci26 · 2024-02-01T22:06:07Z

if that covers your requirements @caoyukun0430 I am tempted to say YAGNI and then move with the simple API change as proposed here and then we can always deprecate and extend if the use case appears.

@jmarantz to your question

If OM is not working for you can't you just remove the OM config?

The problem that many folks have is when running in k8s you dont want to expose the admin listener to the world so contour (and I assume other providers) expose a different standard envoy listener lets call it admin-listener that exposes a subset of envoy admin routes like /ready /stats to other systems running in k8s.

With this change we allow envoy to be able to answer to health checks and metrics when the OM triggers because the admin-listener is exempt from OM. Or you don't want envoy when the OM trigger to not be able to answer health checks (those are in admin-listener) because that you mark your container as unhealthy and make the orchestrator (k8s) to restart it which defats the purpose of OM in many cases

So people like me want the OM just on the listener that serves user traffic and not on listeners serving auxiliary traffic

Hope that explains a bit better

davinci26 · 2024-02-05T13:54:55Z

I started working on the PR this weekend, it is more of a side project even tho I would like to use it at work so updates are going to be a bit slower. I will try to make a PR tomorrow or friday. One small thing is that the code has been through some refactor to introduce overload_state which doesn't change much but it makes a bit of a difference

cancecen · 2024-02-13T22:39:44Z

We are very much looking forward to this change, as we are rolling out a new CPU utilization monitor that sheds load using overload manager when our nodes are overloaded and we want to restrict this behavior to certain request. Adding this flag on the listener helps, but there may be use cases like "allow all traffic from the localhost even when overloaded" and this traffic may land on multiple listeners. So, it may be better to do this at filter chain level? That might be a big change now, so happy to start here, but something to keep in mind if this work is still WIP.

github-actions · 2024-03-15T00:02:32Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2024-03-22T00:02:50Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Commit Message: Add the ability to bypass overload manager for listeners Additional Description: This flag can be used to disable overload manager on specific listeners where, for instance, we don't want to stop accepting requests. In my company, we implemented a CPU Utilization resource monitor that helps us drop requests when we hit a certain utilization percentage, but there are certain listeners that receive administrative traffic that we don't want overload manager to touch. Another use case is, we want to only throttle ingress traffic but not egress traffic going via Envoy. Another contributor authored #29781, but it has been marked as stale. Risk Level: Low Testing: Unit tests & Integration tests added Docs Changes: No Release Notes: Add bypass_overload_manager flag to Listener in order to prevent overload manager from taking actions on the traffic going through the said listener. Platform Specific Features: Signed-off-by: Fernando Cainelli <fernando.cainelli-external@getyourguide.com> Signed-off-by: Can Cecen <ccecen@netflix.com>

Add the ability to bypass overload manager for listeners (envoyproxy#34322) Commit Message: Add the ability to bypass overload manager for listeners Additional Description: This flag can be used to disable overload manager on specific listeners where, for instance, we don't want to stop accepting requests. In my company, we implemented a CPU Utilization resource monitor that helps us drop requests when we hit a certain utilization percentage, but there are certain listeners that receive administrative traffic that we don't want overload manager to touch. Another use case is, we want to only throttle ingress traffic but not egress traffic going via Envoy. Another contributor authored envoyproxy#29781, but it has been marked as stale. Risk Level: Low Testing: Unit tests & Integration tests added Docs Changes: No Release Notes: Add bypass_overload_manager flag to Listener in order to prevent overload manager from taking actions on the traffic going through the said listener. Platform Specific Features: Signed-off-by: Fernando Cainelli <fernando.cainelli-external@getyourguide.com> Signed-off-by: Can Cecen <ccecen@netflix.com> Signed-off-by: Can Cecen <cecen.ycan@gmail.com>

briansonnenberg added 2 commits September 21, 2023 22:00

First pass at envoyproxy#23843

96a43c3

Switch to using null_overload_manager_ pattern for envoyproxy#23843

23a4c53

briansonnenberg requested review from kyessenov, adisuissa, alyssawilk and mattklein123 as code owners September 22, 2023 19:54

repokitteh-read-only bot added v2-freeze api labels Sep 22, 2023

repokitteh-read-only bot assigned markdroth Sep 22, 2023

ravenblackx assigned kyessenov Sep 25, 2023

jmarantz marked this pull request as draft September 28, 2023 13:07

jmarantz assigned KBaichoo Sep 28, 2023

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 28, 2023

github-actions bot closed this Nov 5, 2023

KBaichoo reopened this Nov 17, 2023

KBaichoo marked this pull request as ready for review November 17, 2023 21:26

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 18, 2023

KBaichoo reviewed Nov 20, 2023

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 20, 2023

kyessenov removed their assignment Nov 20, 2023

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 31, 2023

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jan 2, 2024

briansonnenberg mentioned this pull request Jan 10, 2024

agent: host Envoy readiness port on the agent istio/istio#48709

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 15, 2024

github-actions bot closed this Mar 22, 2024

cancecen mentioned this pull request May 23, 2024

Add the ability to bypass overload manager for listeners #34322

Merged

cancecen mentioned this pull request Sep 10, 2024

Backport #34322 to release/v1.29. #36064

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overload manager bypass flag for listeners #29781

Overload manager bypass flag for listeners #29781

briansonnenberg commented Sep 22, 2023

repokitteh-read-only bot commented Sep 22, 2023

jmarantz commented Sep 28, 2023

briansonnenberg commented Sep 28, 2023

github-actions bot commented Oct 28, 2023

github-actions bot commented Nov 5, 2023

KBaichoo commented Nov 17, 2023

kyessenov commented Nov 20, 2023

KBaichoo left a comment

KBaichoo Nov 20, 2023

KBaichoo Nov 20, 2023

KBaichoo Nov 20, 2023

KBaichoo Nov 20, 2023

github-actions bot commented Dec 31, 2023

KBaichoo commented Jan 2, 2024

briansonnenberg commented Jan 5, 2024

caoyukun0430 commented Jan 23, 2024

davinci26 commented Jan 31, 2024

davinci26 commented Jan 31, 2024

briansonnenberg commented Jan 31, 2024

caoyukun0430 commented Feb 1, 2024

davinci26 commented Feb 1, 2024 •

edited

Loading

davinci26 commented Feb 5, 2024

cancecen commented Feb 13, 2024

github-actions bot commented Mar 15, 2024

github-actions bot commented Mar 22, 2024

Overload manager bypass flag for listeners #29781

Overload manager bypass flag for listeners #29781

Conversation

briansonnenberg commented Sep 22, 2023

repokitteh-read-only bot commented Sep 22, 2023

jmarantz commented Sep 28, 2023

briansonnenberg commented Sep 28, 2023

github-actions bot commented Oct 28, 2023

github-actions bot commented Nov 5, 2023

KBaichoo commented Nov 17, 2023

kyessenov commented Nov 20, 2023

KBaichoo left a comment

Choose a reason for hiding this comment

KBaichoo Nov 20, 2023

Choose a reason for hiding this comment

KBaichoo Nov 20, 2023

Choose a reason for hiding this comment

KBaichoo Nov 20, 2023

Choose a reason for hiding this comment

KBaichoo Nov 20, 2023

Choose a reason for hiding this comment

github-actions bot commented Dec 31, 2023

KBaichoo commented Jan 2, 2024

briansonnenberg commented Jan 5, 2024

caoyukun0430 commented Jan 23, 2024

davinci26 commented Jan 31, 2024

davinci26 commented Jan 31, 2024

briansonnenberg commented Jan 31, 2024

caoyukun0430 commented Feb 1, 2024

davinci26 commented Feb 1, 2024 • edited Loading

davinci26 commented Feb 5, 2024

cancecen commented Feb 13, 2024

github-actions bot commented Mar 15, 2024

github-actions bot commented Mar 22, 2024

davinci26 commented Feb 1, 2024 •

edited

Loading