Detect and warn/ratelimit when xDS updates are spinning #2169

htuch · 2017-12-07T14:55:07Z

As discussed in envoyproxy/data-plane-api#328, spinning updates are a reasonably common issue that management server implementors encounter. While we've improved documentation, an ability to detect, warn and/or ratelimit might also be helpful.

gsagula · 2018-02-23T07:40:38Z

I spoke briefly with @htuch about this issue. It looks like that we have three options here:

warn -> never fail updates (ENVOY_LOG), works well toward well-behaved systems but hides the broken behavior from the problematic ones.
rate-limit -> fails updates and warns (ENVOY_LOG), works well toward misbehaved systems, but not much against the well-behaved ones.
error -> fails updates and sends error details in DiscoveryRequest.

I'm not sure which of these three would be sufficient. Thoughts?

htuch · 2018-02-23T15:57:12Z

I think we should do (1) for sure. It's probably legit to do (2) as well, but we should make sure the rate limit only kicks in when we're certain it's not going to be just a well behaved system doing some rapid updates.

Maybe start with a PR for (1) and we can iterate?

gsagula · 2018-02-23T18:13:19Z

Sounds good to me. I agree that it's better to keep it simple for now, and we add the rest later if necessary.

mattklein123 · 2018-02-23T19:00:53Z

@gsagula I don't fully follow the options presented. Do you mind rephrasing/clarifying to make sure we are all on the same page? Thank you!

gsagula · 2018-02-24T18:15:18Z

Sorry, explaining things is not the best of my abilities. I'm working on that though :) Rephrased:

warning -> when a high rate of DiscoveryResponses is detected, Envoy logs a warning (ENVOY_LOG), but does not fail the updates. The upside of this approach is that systems that are doing rapid updates will never be penalized in the case of getting caught by the rating-limit algorithm. The downside is that it tends to make the broken behaviour of problematic systems less obvious.
rate-limit -> in this approach, when a high rate of DiscoveryResponses is detected, Envoy not only fails the updates but also logs it. The side effect of this, as opposed to (1), is that well-behaved systems might be penalized by doing rapid updates.
error -> similar to the approach (2), the only difference is that when Envoy fails the updates, an error is attached to DiscoveryRequest, which allows the management server to mitigate the problem somehow.

e.g.,

  error_detail->set_code(Grpc::Status::GrpcStatus::RateLimit);
  error_detail->set_message(e.what());

@mattklein123 Let me know if that makes sense, please. Quick question based on your comment in envoyproxy/data-plane-api#328
Should rate limit discovery requests or discovery responses? Does it matter?

Thanks!

jsedgwick · 2018-02-24T20:14:01Z

@mattklein123 @gsagula

Q: What about modifying approach 2 to rate limit (or outright fail) only consecutive identical updates? Then well-behaved systems won't be punished.

Side Q: Hypothetically, what happens if an otherwise well behaved xDS system sends valid (non-duplicate) updates at too high a rate? Would a well-known rate limit help there as well, if not as aggressive as one against spinning updates?

gsagula · 2018-02-24T20:50:00Z

@jsedgwick I thought about that too. It looks like that it is possible to track consecutive identical responses. In this case, wouldn't it be better to go with the 3rd option since updates are going to fail anyway?

mattklein123 · 2018-02-25T22:24:36Z

@gsagula thanks for the detailed explanation. Much more clear. I agree with @htuch that we should start with (1), but I'm doubtful it's going to help that much since I think people will likely not look at the logs, but we can see how much it helps. When you implement (1) assuming we log at warn level, we probably need to rate limit the log output to 1 log per unit of time even if we don't rate limit anything else, so I would think about that also. It's probably worthwhile to introduce logging macros that can also rate limit if needed.

If we go towards (2) and (3), I think we should definitely do (3) and send an error message to the xDS server. For your other question, I think we should effectively rate limit the xDS requests that we make. This will avoid spinning if the server just returns another response immediately, but we could do something like fail a quick response, and then stall before sending the next request.

Per @jsedgwick I think we can consider also more intelligent things like checking for duplicates, adding configurable rate limits, etc., but it's probably worth just starting with (1) and seeing how far we get. Maybe it will be enough.

gsagula · 2018-02-26T18:06:12Z

Same page @mattklein123. Modifying the first solution to (3) later on should be trivial. Thank you for the details. I'll work on the PR.

This PR implements a mechanism to detect when xDS update requests go over a pre-specified limit. Issue: #2169 What it does: Detects when xDS update requests go over the limit. Issues one log warn level when over the limit is detected. Limits warnings to one log on every 5 seconds. Risk Level: Low Testing: unit test, manual testing. Signed-off-by: Gabriel <gsagula@gmail.com>

htuch · 2018-04-16T09:50:19Z

Fixed in #2783.

Diff: c96f711...60a13f3 Mostly pulling in for #20527 Signed-off-by: JP Simard <jp@jpsim.com>

htuch added enhancement Feature requests. Not bugs or questions. help wanted Needs help! api/v2 labels Dec 7, 2017

htuch mentioned this issue Dec 7, 2017

xds_protocol: make explicit that updates only occur when resources ch… envoyproxy/data-plane-api#328

Merged

htuch assigned gsagula Feb 18, 2018

htuch removed the help wanted Needs help! label Feb 23, 2018

kyessenov mentioned this issue Mar 1, 2018

Avoid repeated push of rejected configuration envoyproxy/go-control-plane#46

Closed

gsagula mentioned this issue Mar 13, 2018

xds_rate_limit: implemented xds rate-limit and alerts #2783

Merged

htuch closed this as completed Apr 16, 2018

htuch mentioned this issue Apr 16, 2018

Support xDS partial rejections #3079

Closed

junr03 mentioned this issue Oct 15, 2018

continuous update rejects could flood envoy with discovery requests #4718

Closed

ramaraochavali mentioned this issue Apr 15, 2019

DDoS prevention envoyproxy/java-control-plane#101

Closed

jpsim added a commit that referenced this issue Nov 28, 2022

Update Envoy to 60a13f3 (#2169)

5b77827

Diff: c96f711...60a13f3 Mostly pulling in for #20527 Signed-off-by: JP Simard <jp@jpsim.com>

jpsim added a commit that referenced this issue Nov 29, 2022

Update Envoy to 60a13f3 (#2169)

949a267

Diff: c96f711...60a13f3 Mostly pulling in for #20527 Signed-off-by: JP Simard <jp@jpsim.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect and warn/ratelimit when xDS updates are spinning #2169

Detect and warn/ratelimit when xDS updates are spinning #2169

htuch commented Dec 7, 2017

gsagula commented Feb 23, 2018 •

edited

Loading

htuch commented Feb 23, 2018

gsagula commented Feb 23, 2018 •

edited

Loading

mattklein123 commented Feb 23, 2018

gsagula commented Feb 24, 2018 •

edited

Loading

jsedgwick commented Feb 24, 2018

gsagula commented Feb 24, 2018 •

edited

Loading

mattklein123 commented Feb 25, 2018

gsagula commented Feb 26, 2018 •

edited

Loading

htuch commented Apr 16, 2018

Detect and warn/ratelimit when xDS updates are spinning #2169

Detect and warn/ratelimit when xDS updates are spinning #2169

Comments

htuch commented Dec 7, 2017

gsagula commented Feb 23, 2018 • edited Loading

htuch commented Feb 23, 2018

gsagula commented Feb 23, 2018 • edited Loading

mattklein123 commented Feb 23, 2018

gsagula commented Feb 24, 2018 • edited Loading

jsedgwick commented Feb 24, 2018

gsagula commented Feb 24, 2018 • edited Loading

mattklein123 commented Feb 25, 2018

gsagula commented Feb 26, 2018 • edited Loading

htuch commented Apr 16, 2018

gsagula commented Feb 23, 2018 •

edited

Loading

gsagula commented Feb 23, 2018 •

edited

Loading

gsagula commented Feb 24, 2018 •

edited

Loading

gsagula commented Feb 24, 2018 •

edited

Loading

gsagula commented Feb 26, 2018 •

edited

Loading