-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect and warn/ratelimit when xDS updates are spinning #2169
Comments
I spoke briefly with @htuch about this issue. It looks like that we have three options here:
I'm not sure which of these three would be sufficient. Thoughts? |
I think we should do (1) for sure. It's probably legit to do (2) as well, but we should make sure the rate limit only kicks in when we're certain it's not going to be just a well behaved system doing some rapid updates. Maybe start with a PR for (1) and we can iterate? |
Sounds good to me. I agree that it's better to keep it simple for now, and we add the rest later if necessary. |
@gsagula I don't fully follow the options presented. Do you mind rephrasing/clarifying to make sure we are all on the same page? Thank you! |
Sorry, explaining things is not the best of my abilities. I'm working on that though :) Rephrased:
e.g.,
@mattklein123 Let me know if that makes sense, please. Quick question based on your comment in envoyproxy/data-plane-api#328 Thanks! |
Q: What about modifying approach 2 to rate limit (or outright fail) only consecutive identical updates? Then well-behaved systems won't be punished. Side Q: Hypothetically, what happens if an otherwise well behaved xDS system sends valid (non-duplicate) updates at too high a rate? Would a well-known rate limit help there as well, if not as aggressive as one against spinning updates? |
@jsedgwick I thought about that too. It looks like that it is possible to track consecutive identical responses. In this case, wouldn't it be better to go with the 3rd option since updates are going to fail anyway? |
@gsagula thanks for the detailed explanation. Much more clear. I agree with @htuch that we should start with (1), but I'm doubtful it's going to help that much since I think people will likely not look at the logs, but we can see how much it helps. When you implement (1) assuming we log at warn level, we probably need to rate limit the log output to 1 log per unit of time even if we don't rate limit anything else, so I would think about that also. It's probably worthwhile to introduce logging macros that can also rate limit if needed. If we go towards (2) and (3), I think we should definitely do (3) and send an error message to the xDS server. For your other question, I think we should effectively rate limit the xDS requests that we make. This will avoid spinning if the server just returns another response immediately, but we could do something like fail a quick response, and then stall before sending the next request. Per @jsedgwick I think we can consider also more intelligent things like checking for duplicates, adding configurable rate limits, etc., but it's probably worth just starting with (1) and seeing how far we get. Maybe it will be enough. |
Same page @mattklein123. Modifying the first solution to (3) later on should be trivial. Thank you for the details. I'll work on the PR. |
This PR implements a mechanism to detect when xDS update requests go over a pre-specified limit. Issue: #2169 What it does: Detects when xDS update requests go over the limit. Issues one log warn level when over the limit is detected. Limits warnings to one log on every 5 seconds. Risk Level: Low Testing: unit test, manual testing. Signed-off-by: Gabriel <gsagula@gmail.com>
Fixed in #2783. |
Diff: c96f711...60a13f3 Mostly pulling in for #20527 Signed-off-by: JP Simard <jp@jpsim.com>
Diff: c96f711...60a13f3 Mostly pulling in for #20527 Signed-off-by: JP Simard <jp@jpsim.com>
As discussed in envoyproxy/data-plane-api#328, spinning updates are a reasonably common issue that management server implementors encounter. While we've improved documentation, an ability to detect, warn and/or ratelimit might also be helpful.
The text was updated successfully, but these errors were encountered: