-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overload manager bypass flag for listeners #29781
Overload manager bypass flag for listeners #29781
Conversation
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
I flipped this over to draft as it claims to be a WiP. Moreover I don't fully understand why we'd need a code change to make a null overload manager. If OM is not working for you can't you just remove the OM config? Anyway I think the issue as described above is not that the OM isn't working; it's that the stock memory tracking mechanism isn't working, though this may be dependent on how Envoy is built. |
This code change isn't to address the issue I described—that is a related thing that I found while testing this. This is for issue #23843.
I agree. |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Sorry @briansonnenberg , I think I lost track of this as it was marked as WIP. Moving it to ready for review so I remember to review it. |
/wait-for any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @briansonnenberg !
Here's a first pass.
/wait
// Specifies if traffic accepted by this listener should be allowed in | ||
// overload scenarios (e.g. listener handles health probes or otherwise | ||
// critical traffic). Default: false | ||
bool bypass_overload_manager = 23; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still make changes to v2? I think it's frozen and shouldn't be changed.
/** | ||
* Check whether the listener should bypass overload manager actions | ||
*/ | ||
virtual bool getBypassOverloadManager() PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kind of confusingly named since we're not returning a overload manager or so. e.g. getX doesn't return X.
Consider: shouldBypassOverloadManager
or similar if we need to return a bool
* @return NullOverloadManager& the dummy overload manager for the server for | ||
* listeners that are bypassing a configured OverloadManager | ||
*/ | ||
virtual NullOverloadManager& nullOverloadManager() PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe interface should return an OverloadManager& instead of this subclass so that it's "opaque" whether we have one or the other. This would also allow us push the implementation outside of the interface file as currently done.
* Implementation of OverloadManager that is never overloaded. Using this instead of the real | ||
* OverloadManager keeps the interface accessible even when the proxy is overloaded. | ||
*/ | ||
struct NullOverloadManager : public OverloadManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to the overload_manager_impl file to follow the standard convention within the codebase?
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
any additional progress on this @briansonnenberg ? |
@KBaichoo Yeah, looking at this it looks like there was a pretty large refactor that is causing a lot of conflicts with these changes. Looking into it now. |
Hi @briansonnenberg , thanks a lot for picking my issue up, great work! Allowing listeners other than probes to bypass might also hinder an overloaded envoy running as long as it can as @kyessenov mentioned. Considering envoy listening on 15021 for liveness probe and 8000 for HTTP traffic. I would suggest we only bypass the 15021 port in the overload case as the 8000 is the source of overloading and it should not be bypassed. Your opinions are appreciated! thanks! |
Hey @briansonnenberg thanks for working on this. I am also coming from the angle of https://github.com/projectcontour/contour which would also would love to use this feature because it follows a similar configuration pattern of exposing a subset of
That makes overall sense to me.
From my perspective this PR seems like it can cover 95% of the usecase but for the remaining we need some thought and this is useful as it stands. So I wonder if we can just change the API surface:
And that way we can extend the config to capture all of those use cases in the future by having a dedicated api config. |
In any case @briansonnenberg if you end up becoming too busy, life got in the way and want to hand this over. I am happy to make a PR on top of your PR to help merge this. |
I like this, and agree that it's good enough for most use cases as it currently is. Feel free to add to this. I will get around to it eventually, but have different work prioritized currently. |
Hi @davinci26 , thanks for the comments! It's nice to have you and @briansonnenberg willing to discuss and pick up. Overall, I agree with your points, about IPs, yes, filtering over IP can add overhead and it might be not what we want under overload manager scenarios. |
if that covers your requirements @caoyukun0430 I am tempted to say @jmarantz to your question
The problem that many folks have is when running in k8s you dont want to expose the With this change we allow envoy to be able to answer to health checks and metrics when the So people like me want the OM just on the listener that serves user traffic and not on listeners serving auxiliary traffic Hope that explains a bit better |
I started working on the PR this weekend, it is more of a side project even tho I would like to use it at work so updates are going to be a bit slower. I will try to make a PR tomorrow or friday. One small thing is that the code has been through some refactor to introduce |
We are very much looking forward to this change, as we are rolling out a new CPU utilization monitor that sheds load using overload manager when our nodes are overloaded and we want to restrict this behavior to certain request. Adding this flag on the listener helps, but there may be use cases like "allow all traffic from the localhost even when overloaded" and this traffic may land on multiple listeners. So, it may be better to do this at filter chain level? That might be a big change now, so happy to start here, but something to keep in mind if this work is still WIP. |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Commit Message: Add the ability to bypass overload manager for listeners Additional Description: This flag can be used to disable overload manager on specific listeners where, for instance, we don't want to stop accepting requests. In my company, we implemented a CPU Utilization resource monitor that helps us drop requests when we hit a certain utilization percentage, but there are certain listeners that receive administrative traffic that we don't want overload manager to touch. Another use case is, we want to only throttle ingress traffic but not egress traffic going via Envoy. Another contributor authored #29781, but it has been marked as stale. Risk Level: Low Testing: Unit tests & Integration tests added Docs Changes: No Release Notes: Add bypass_overload_manager flag to Listener in order to prevent overload manager from taking actions on the traffic going through the said listener. Platform Specific Features: Signed-off-by: Fernando Cainelli <fernando.cainelli-external@getyourguide.com> Signed-off-by: Can Cecen <ccecen@netflix.com>
Add the ability to bypass overload manager for listeners (envoyproxy#34322) Commit Message: Add the ability to bypass overload manager for listeners Additional Description: This flag can be used to disable overload manager on specific listeners where, for instance, we don't want to stop accepting requests. In my company, we implemented a CPU Utilization resource monitor that helps us drop requests when we hit a certain utilization percentage, but there are certain listeners that receive administrative traffic that we don't want overload manager to touch. Another use case is, we want to only throttle ingress traffic but not egress traffic going via Envoy. Another contributor authored envoyproxy#29781, but it has been marked as stale. Risk Level: Low Testing: Unit tests & Integration tests added Docs Changes: No Release Notes: Add bypass_overload_manager flag to Listener in order to prevent overload manager from taking actions on the traffic going through the said listener. Platform Specific Features: Signed-off-by: Fernando Cainelli <fernando.cainelli-external@getyourguide.com> Signed-off-by: Can Cecen <ccecen@netflix.com> Signed-off-by: Can Cecen <cecen.ycan@gmail.com>
WIP + missing tests and release notes. Opening for discussion.
Pulled the NullOverloadManager instantiation out of admin and moved it into server, so that it can be used across any listeners that want to bypass overload manager. A health probe listener for example.
On a related note, while working on this I tried a "stop_accepting_requests" action in conjunction with "shrink_heap" and it seems once memory crosses the threshold, it never comes back down regardless of the shrink heap action, so it just keeps rejecting requests indefinitely. I'm not really sure what the point of the overload manager with memory thresholds is if the memory never comes back down. This seems true for both gperftools and tcmalloc.