Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overload manager bypass flag for listeners #29781

Conversation

briansonnenberg
Copy link
Contributor

WIP + missing tests and release notes. Opening for discussion.

Pulled the NullOverloadManager instantiation out of admin and moved it into server, so that it can be used across any listeners that want to bypass overload manager. A health probe listener for example.

On a related note, while working on this I tried a "stop_accepting_requests" action in conjunction with "shrink_heap" and it seems once memory crosses the threshold, it never comes back down regardless of the shrink heap action, so it just keeps rejecting requests indefinitely. I'm not really sure what the point of the overload manager with memory thresholds is if the memory never comes back down. This seems true for both gperftools and tcmalloc.

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy[\w/]*/(v1alpha\d?|v1|v2alpha\d?|v2))|(api/envoy/type/(matcher/)?\w+.proto).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #29781 was opened by briansonnenberg.

see: more, trace.

@jmarantz
Copy link
Contributor

I flipped this over to draft as it claims to be a WiP. Moreover I don't fully understand why we'd need a code change to make a null overload manager. If OM is not working for you can't you just remove the OM config?

Anyway I think the issue as described above is not that the OM isn't working; it's that the stock memory tracking mechanism isn't working, though this may be dependent on how Envoy is built.

@briansonnenberg
Copy link
Contributor Author

I flipped this over to draft as it claims to be a WiP. Moreover I don't fully understand why we'd need a code change to make a null overload manager. If OM is not working for you can't you just remove the OM config?

This code change isn't to address the issue I described—that is a related thing that I found while testing this. This is for issue #23843.

Anyway I think the issue as described above is not that the OM isn't working; it's that the stock memory tracking mechanism isn't working, though this may be dependent on how Envoy is built.

I agree.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 28, 2023
Copy link

github-actions bot commented Nov 5, 2023

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Nov 5, 2023
@KBaichoo KBaichoo reopened this Nov 17, 2023
@KBaichoo
Copy link
Contributor

Sorry @briansonnenberg , I think I lost track of this as it was marked as WIP. Moving it to ready for review so I remember to review it.

@KBaichoo KBaichoo marked this pull request as ready for review November 17, 2023 21:26
@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 18, 2023
@kyessenov
Copy link
Contributor

/wait-for any

Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @briansonnenberg !

Here's a first pass.
/wait

// Specifies if traffic accepted by this listener should be allowed in
// overload scenarios (e.g. listener handles health probes or otherwise
// critical traffic). Default: false
bool bypass_overload_manager = 23;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still make changes to v2? I think it's frozen and shouldn't be changed.

/**
* Check whether the listener should bypass overload manager actions
*/
virtual bool getBypassOverloadManager() PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of confusingly named since we're not returning a overload manager or so. e.g. getX doesn't return X.

Consider: shouldBypassOverloadManager or similar if we need to return a bool

* @return NullOverloadManager& the dummy overload manager for the server for
* listeners that are bypassing a configured OverloadManager
*/
virtual NullOverloadManager& nullOverloadManager() PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe interface should return an OverloadManager& instead of this subclass so that it's "opaque" whether we have one or the other. This would also allow us push the implementation outside of the interface file as currently done.

* Implementation of OverloadManager that is never overloaded. Using this instead of the real
* OverloadManager keeps the interface accessible even when the proxy is overloaded.
*/
struct NullOverloadManager : public OverloadManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the overload_manager_impl file to follow the standard convention within the codebase?

@kyessenov kyessenov removed their assignment Nov 20, 2023
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 31, 2023
@KBaichoo
Copy link
Contributor

KBaichoo commented Jan 2, 2024

any additional progress on this @briansonnenberg ?

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jan 2, 2024
@briansonnenberg
Copy link
Contributor Author

@KBaichoo Yeah, looking at this it looks like there was a pretty large refactor that is causing a lot of conflicts with these changes. Looking into it now.

@caoyukun0430
Copy link

Hi @briansonnenberg , thanks a lot for picking my issue up, great work!
But I have some questions about code changes:
If I understand correctly connection_handler_impl.cc, as long as bypassoverloadmanager flag is enabled., listener works as "normal", meaning like there is no overloadMgr configured.
My concern is that this might be too "broad" as all listeners are allowed without restriction. My initial idea also as people proposed in the issue #23843 (comment) is that bypassing can be allowed under certain criteria: e.g. combination of IP+port(for my issue 15021 health check port)

Allowing listeners other than probes to bypass might also hinder an overloaded envoy running as long as it can as @kyessenov mentioned. Considering envoy listening on 15021 for liveness probe and 8000 for HTTP traffic. I would suggest we only bypass the 15021 port in the overload case as the 8000 is the source of overloading and it should not be bypassed.

Your opinions are appreciated! thanks!

@davinci26
Copy link
Member

Hey @briansonnenberg thanks for working on this.

I am also coming from the angle of https://github.com/projectcontour/contour which would also would love to use this feature because it follows a similar configuration pattern of exposing a subset of admin interface via a dedicated listener. And we would like to exempt this listener from overload manager to avoid k8s liveness/readiness/stats getting shedded when the system is overloaded.

@caoyukun0430

My initial idea also as people proposed in the issue #23843 (comment) is that bypassing can be allowed under certain criteria: e.g. combination of IP+port(for my issue 15021 health check port)

That makes overall sense to me.

  • With respect to port do you mean source port? Because if it is the port that Envoy listens to isn't that basically what we implement here? I assume 15021 is a dedicated port for health checks and you have a dedicated envoy listener listening on that port? Or am I missing something?

  • With regards to source IP that can be allowlisted instead of allowing all traffic that also makes sense but the question is that in order to have IP information we need to work on all incoming requests i.e. accept the request. So there is a cost associated to it that might push Envoy and that might defeat the purpose of overload manager. This depends on the implementation details.

From my perspective this PR seems like it can cover 95% of the usecase but for the remaining we need some thought and this is useful as it stands. So I wonder if we can just change the API surface:

message  OverloadManagerListenerConfig {
     bool bypass_overload_manager = 1;
}
OverloadManagerListenerConfig overload_manager_listener_config

And that way we can extend the config to capture all of those use cases in the future by having a dedicated api config.

@davinci26
Copy link
Member

In any case @briansonnenberg if you end up becoming too busy, life got in the way and want to hand this over.

I am happy to make a PR on top of your PR to help merge this.

@briansonnenberg
Copy link
Contributor Author

From my perspective this PR seems like it can cover 95% of the usecase but for the remaining we need some thought and this is useful as it stands. So I wonder if we can just change the API surface:

message  OverloadManagerListenerConfig {
     bool bypass_overload_manager = 1;
}
OverloadManagerListenerConfig overload_manager_listener_config

And that way we can extend the config to capture all of those use cases in the future by having a dedicated api config.

I like this, and agree that it's good enough for most use cases as it currently is. Feel free to add to this. I will get around to it eventually, but have different work prioritized currently.

@caoyukun0430
Copy link

Hey @briansonnenberg thanks for working on this.

I am also coming from the angle of https://github.com/projectcontour/contour which would also would love to use this feature because it follows a similar configuration pattern of exposing a subset of admin interface via a dedicated listener. And we would like to exempt this listener from overload manager to avoid k8s liveness/readiness/stats getting shedded when the system is overloaded.

@caoyukun0430

My initial idea also as people proposed in the issue #23843 (comment) is that bypassing can be allowed under certain criteria: e.g. combination of IP+port(for my issue 15021 health check port)

That makes overall sense to me.

  • With respect to port do you mean source port? Because if it is the port that Envoy listens to isn't that basically what we implement here? I assume 15021 is a dedicated port for health checks and you have a dedicated envoy listener listening on that port? Or am I missing something?
  • With regards to source IP that can be allowlisted instead of allowing all traffic that also makes sense but the question is that in order to have IP information we need to work on all incoming requests i.e. accept the request. So there is a cost associated to it that might push Envoy and that might defeat the purpose of overload manager. This depends on the implementation details.

From my perspective this PR seems like it can cover 95% of the usecase but for the remaining we need some thought and this is useful as it stands. So I wonder if we can just change the API surface:

message  OverloadManagerListenerConfig {
     bool bypass_overload_manager = 1;
}
OverloadManagerListenerConfig overload_manager_listener_config

And that way we can extend the config to capture all of those use cases in the future by having a dedicated api config.

Hi @davinci26 , thanks for the comments! It's nice to have you and @briansonnenberg willing to discuss and pick up. Overall, I agree with your points, about IPs, yes, filtering over IP can add overhead and it might be not what we want under overload manager scenarios.
But about the options of port. I think I might have misunderstood before, so the bypass_overload_manager bool flag can be configured per listener, and it should fulfill the need to be enabled on the health check 15021 listener :)

@davinci26
Copy link
Member

davinci26 commented Feb 1, 2024

if that covers your requirements @caoyukun0430 I am tempted to say YAGNI and then move with the simple API change as proposed here and then we can always deprecate and extend if the use case appears.

@jmarantz to your question

If OM is not working for you can't you just remove the OM config?

The problem that many folks have is when running in k8s you dont want to expose the admin listener to the world so contour (and I assume other providers) expose a different standard envoy listener lets call it admin-listener that exposes a subset of envoy admin routes like /ready /stats to other systems running in k8s.

With this change we allow envoy to be able to answer to health checks and metrics when the OM triggers because the admin-listener is exempt from OM. Or you don't want envoy when the OM trigger to not be able to answer health checks (those are in admin-listener) because that you mark your container as unhealthy and make the orchestrator (k8s) to restart it which defats the purpose of OM in many cases

So people like me want the OM just on the listener that serves user traffic and not on listeners serving auxiliary traffic

Hope that explains a bit better

@davinci26
Copy link
Member

I started working on the PR this weekend, it is more of a side project even tho I would like to use it at work so updates are going to be a bit slower. I will try to make a PR tomorrow or friday. One small thing is that the code has been through some refactor to introduce overload_state which doesn't change much but it makes a bit of a difference

@cancecen
Copy link
Contributor

We are very much looking forward to this change, as we are rolling out a new CPU utilization monitor that sheds load using overload manager when our nodes are overloaded and we want to restrict this behavior to certain request. Adding this flag on the listener helps, but there may be use cases like "allow all traffic from the localhost even when overloaded" and this traffic may land on multiple listeners. So, it may be better to do this at filter chain level? That might be a big change now, so happy to start here, but something to keep in mind if this work is still WIP.

Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 15, 2024
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Mar 22, 2024
RyanTheOptimist pushed a commit that referenced this pull request Jun 3, 2024
Commit Message: Add the ability to bypass overload manager for listeners
Additional Description: This flag can be used to disable overload manager on specific listeners where, for instance, we don't want to stop accepting requests. In my company, we implemented a CPU Utilization resource monitor that helps us drop requests when we hit a certain utilization percentage, but there are certain listeners that receive administrative traffic that we don't want overload manager to touch. Another use case is, we want to only throttle ingress traffic but not egress traffic going via Envoy. Another contributor authored #29781, but it has been marked as stale.

Risk Level: Low
Testing: Unit tests & Integration tests added
Docs Changes: No
Release Notes: Add bypass_overload_manager flag to Listener in order to prevent overload manager from taking actions on the traffic going through the said listener.
Platform Specific Features:

Signed-off-by: Fernando Cainelli <fernando.cainelli-external@getyourguide.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
cancecen added a commit to cancecen/envoy that referenced this pull request Sep 10, 2024
Add the ability to bypass overload manager for listeners (envoyproxy#34322)

Commit Message: Add the ability to bypass overload manager for listeners
Additional Description: This flag can be used to disable overload manager on specific listeners where, for instance, we don't want to stop accepting requests. In my company, we implemented a CPU Utilization resource monitor that helps us drop requests when we hit a certain utilization percentage, but there are certain listeners that receive administrative traffic that we don't want overload manager to touch. Another use case is, we want to only throttle ingress traffic but not egress traffic going via Envoy. Another contributor authored envoyproxy#29781, but it has been marked as stale.

Risk Level: Low
Testing: Unit tests & Integration tests added
Docs Changes: No
Release Notes: Add bypass_overload_manager flag to Listener in order to prevent overload manager from taking actions on the traffic going through the said listener.
Platform Specific Features:

Signed-off-by: Fernando Cainelli <fernando.cainelli-external@getyourguide.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>

Signed-off-by: Can Cecen <cecen.ycan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api stale stalebot believes this issue/PR has not been touched recently v2-freeze waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants