Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and warn/ratelimit when xDS updates are spinning #2169

Closed
htuch opened this issue Dec 7, 2017 · 10 comments
Closed

Detect and warn/ratelimit when xDS updates are spinning #2169

htuch opened this issue Dec 7, 2017 · 10 comments
Assignees
Labels
api/v2 enhancement Feature requests. Not bugs or questions.

Comments

@htuch
Copy link
Member

htuch commented Dec 7, 2017

As discussed in envoyproxy/data-plane-api#328, spinning updates are a reasonably common issue that management server implementors encounter. While we've improved documentation, an ability to detect, warn and/or ratelimit might also be helpful.

@gsagula
Copy link
Member

gsagula commented Feb 23, 2018

I spoke briefly with @htuch about this issue. It looks like that we have three options here:

  1. warn -> never fail updates (ENVOY_LOG), works well toward well-behaved systems but hides the broken behavior from the problematic ones.

  2. rate-limit -> fails updates and warns (ENVOY_LOG), works well toward misbehaved systems, but not much against the well-behaved ones.

  3. error -> fails updates and sends error details in DiscoveryRequest.

I'm not sure which of these three would be sufficient. Thoughts?

@htuch htuch removed the help wanted Needs help! label Feb 23, 2018
@htuch
Copy link
Member Author

htuch commented Feb 23, 2018

I think we should do (1) for sure. It's probably legit to do (2) as well, but we should make sure the rate limit only kicks in when we're certain it's not going to be just a well behaved system doing some rapid updates.

Maybe start with a PR for (1) and we can iterate?

@gsagula
Copy link
Member

gsagula commented Feb 23, 2018

Sounds good to me. I agree that it's better to keep it simple for now, and we add the rest later if necessary.

@mattklein123
Copy link
Member

@gsagula I don't fully follow the options presented. Do you mind rephrasing/clarifying to make sure we are all on the same page? Thank you!

@gsagula
Copy link
Member

gsagula commented Feb 24, 2018

Sorry, explaining things is not the best of my abilities. I'm working on that though :) Rephrased:

  1. warning -> when a high rate of DiscoveryResponses is detected, Envoy logs a warning (ENVOY_LOG), but does not fail the updates. The upside of this approach is that systems that are doing rapid updates will never be penalized in the case of getting caught by the rating-limit algorithm. The downside is that it tends to make the broken behaviour of problematic systems less obvious.

  2. rate-limit -> in this approach, when a high rate of DiscoveryResponses is detected, Envoy not only fails the updates but also logs it. The side effect of this, as opposed to (1), is that well-behaved systems might be penalized by doing rapid updates.

  3. error -> similar to the approach (2), the only difference is that when Envoy fails the updates, an error is attached to DiscoveryRequest, which allows the management server to mitigate the problem somehow.

e.g.,

  error_detail->set_code(Grpc::Status::GrpcStatus::RateLimit);
  error_detail->set_message(e.what());

@mattklein123 Let me know if that makes sense, please. Quick question based on your comment in envoyproxy/data-plane-api#328
Should rate limit discovery requests or discovery responses? Does it matter?

Thanks!

@jsedgwick
Copy link

@mattklein123 @gsagula

Q: What about modifying approach 2 to rate limit (or outright fail) only consecutive identical updates? Then well-behaved systems won't be punished.

Side Q: Hypothetically, what happens if an otherwise well behaved xDS system sends valid (non-duplicate) updates at too high a rate? Would a well-known rate limit help there as well, if not as aggressive as one against spinning updates?

@gsagula
Copy link
Member

gsagula commented Feb 24, 2018

@jsedgwick I thought about that too. It looks like that it is possible to track consecutive identical responses. In this case, wouldn't it be better to go with the 3rd option since updates are going to fail anyway?

@mattklein123
Copy link
Member

@gsagula thanks for the detailed explanation. Much more clear. I agree with @htuch that we should start with (1), but I'm doubtful it's going to help that much since I think people will likely not look at the logs, but we can see how much it helps. When you implement (1) assuming we log at warn level, we probably need to rate limit the log output to 1 log per unit of time even if we don't rate limit anything else, so I would think about that also. It's probably worthwhile to introduce logging macros that can also rate limit if needed.

If we go towards (2) and (3), I think we should definitely do (3) and send an error message to the xDS server. For your other question, I think we should effectively rate limit the xDS requests that we make. This will avoid spinning if the server just returns another response immediately, but we could do something like fail a quick response, and then stall before sending the next request.

Per @jsedgwick I think we can consider also more intelligent things like checking for duplicates, adding configurable rate limits, etc., but it's probably worth just starting with (1) and seeing how far we get. Maybe it will be enough.

@gsagula
Copy link
Member

gsagula commented Feb 26, 2018

Same page @mattklein123. Modifying the first solution to (3) later on should be trivial. Thank you for the details. I'll work on the PR.

htuch pushed a commit that referenced this issue Apr 15, 2018
This PR implements a mechanism to detect when xDS update requests go over a pre-specified limit.

Issue: #2169

What it does:

Detects when xDS update requests go over the limit.
Issues one log warn level when over the limit is detected.
Limits warnings to one log on every 5 seconds.
Risk Level: Low

Testing: unit test, manual testing.

Signed-off-by: Gabriel <gsagula@gmail.com>
@htuch
Copy link
Member Author

htuch commented Apr 16, 2018

Fixed in #2783.

@htuch htuch closed this as completed Apr 16, 2018
jpsim added a commit that referenced this issue Nov 28, 2022
Diff: c96f711...60a13f3

Mostly pulling in for #20527

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this issue Nov 29, 2022
Diff: c96f711...60a13f3

Mostly pulling in for #20527

Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api/v2 enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

No branches or pull requests

4 participants