Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added rate_limit support to avoid sending too many requests to Consul Agent #1066

Closed
wants to merge 4 commits into from

Conversation

pierresouchay
Copy link

This PR adds support for:

  • waiting at least between 2 downloads when watching a file controlled by rate_limit.min_delay_between_updates
  • add a random delay between fetching data with param rate_limit.random_backoff

This should avoid allow limiting the bandwidth used by consul-template on large clusters with lots of services when watching many services (and X-Consul-Index being updated very frequently)

cf #1065

@pierresouchay
Copy link
Author

Any news on this ?
Since hashicorp/consul#3890 has now been fixed, it should greatly improve performance for large clusters

@pearkes
Copy link
Contributor

pearkes commented Mar 19, 2018

Given consul-template is built on watches, we wonder if the suggested model should be to run it with -once on a schedule or based on an external signal. consul-template was designed to be a thin and concise wrapper around watches, and are concerned that adding complexity between the two will cause issues for other use cases.

Are you relying on de-duplication mode, which is disabled under -once? Or is there any other reason you couldn't use -once and accomplish the same thing as rate limiting?

@pierresouchay
Copy link
Author

@pearkes -once would have a worse behaviour on large clusters for some templates, because each time you would run -once, you would have to re-download everything

If we want to build a large template (for instance, watch all services having tag "http", for each, create a list of endpoint to create a HaProxy / nginx configuration file) => -once is not an option => well it is, but then having the full file rendered would take a lot of time and would not allow to:

  • have babysitting of processes
  • allow to send reload signals

I agree this is not optimal, but it simply renders consul-template not usable for large clusters. On our, when using it, the DL speed between consul-template and its Consul Agent was around 800Mb/sec (and around the same between Consul Agent and Consul Server)

Since hashicorp/consul#3890 has now been merged, the situation is probably far better (except for /catalog/services and friends).

With this implementation https://github.com/criteo/consul-templaterb/ , for the same cluster, we succeeded to have around 5Mbits/sec instead of 800Mbits/sec, after optimization hashicorp/consul#3934 => we are around 2Mbits/sec while watching at all services.

I definitely think we could improve a bit the current implementation.

I'll try with consul-template to see what is the current bandwidth after hashicorp/consul#3934 has been applied

@guidoiaquinti
Copy link

Little bump on this as hashicorp/consul#3934 has been closed

@sodabrew
Copy link

I am also interested in this work, given that wait has been broken since 0.19.1 per my analysis here: #1043 (comment)

Without the wait block, my site's Consul servers were badly pummeled when I tried updating to consul-template 0.19.5 from our previous 0.16 version (I kept the dedup stanza, but removed the wait values -- bad plan).

@pierresouchay
Copy link
Author

@sodabrew be my guest ;)

That's the main reason why I developed https://github.com/criteo/consul-templaterb it reduces the load on servers and bandwidth with orders of magnitude on large clusters while allowing far more features: real language, loops, parametrization, json, XML and even a fully customizable US UI with great performance on very large clusters: https://github.com/criteo/consul-templaterb/tree/master/samples/consul-ui

@eikenb
Copy link
Contributor

eikenb commented Aug 20, 2019

What's the downside to having this enabled?

I ask as given the description and general sentiment makes it sounds like this is a simple win. Why make it configurable? Why not just add a sensible limit all the time? I don't believe in making things configurable just because... there needs to be real, tangible trade-offs that people need to manage for something to justify the complexity of being configurable.

Thanks.

@pierresouchay
Copy link
Author

I would say none. The benefits have been reduced since we made patches in Consul itself, however, it is still a win for pathological cases

@eikenb
Copy link
Contributor

eikenb commented Aug 21, 2019

@sodabrew, @guidoiaquinti, @gmr, and anyone else who cares. Would you agree with @pierresouchay that there is no downside to this? If you agree, I propose we just make this the new behavior and skip the config options.

@gmr
Copy link
Contributor

gmr commented Aug 22, 2019

I'd suggest that changing the behavior as non-optional should be made in major version changes, not minor or revision. Don't get me wrong, I like the new options, but someone might be relying on current behavior and taken by surprise if it were to change in a point release.

@pierresouchay
Copy link
Author

@gmr the default behavior is probably not very wise (and has been changed in Consul in version 1.0.7+ for services )

@gmr
Copy link
Contributor

gmr commented Aug 22, 2019

I guess if it's documented and clear that it's a change in behavior (for the better), then go for it?

@pierresouchay
Copy link
Author

Ark... now conflicting... :(

@eikenb
Copy link
Contributor

eikenb commented Aug 28, 2019

I'm still leaning towards just making this the default behavior and the fact that this has conflicts just reinforces that as it highlights just how much code this has to touch to make it configurable. If not configurable it looks like it'd only mean changing 1 file (view.go).

What I'm thinking basically boils down to taking the logic from RateLimitFunc, put it a function in view.go and call it in the same place as in the patch. I'd guess with the current default values in place (set as vars to allow changing it for tests).

The current default values are:

  • rate_limit.min_delay_between_updates: 100 milliseconds
  • rate_limit.random_backoff: 33 milliseconds

Are these good default values?

@eikenb
Copy link
Contributor

eikenb commented Aug 28, 2019

Assuming we go the non-configurable route. @pierresouchay, is that something you'd want to implement (maybe as a separate PR for comparison) or something I'd just take care of?

@pierresouchay
Copy link
Author

I would be glad to implement this, but won't have the time to in next 5 weeks, so if you need it sooner, feel free to implement.

Regards

@eikenb
Copy link
Contributor

eikenb commented Aug 30, 2019

I have a simplified version of this implemented that only has the little bit of code that actually does the rate limiting in watch/view.go. The current implementation has the base rate limiting value (MinDelayBetweenUpdates in the patch) as a module level variable so it can be 0'd out for tests as needed.

I was looking over the use/implementation in consul-templateb and it really looks like there might be some variety of values that are useful for the delay... so I was re-thinking the possibility of a simple command-line only option to set just the delay. This would let you disable it (by setting it to 0) as well as setting it more based on need. I'd keep it on by default, with the default value as is (100ms). Having it cli only would keep the changes down as well.

Any thoughts on this?

@pierresouchay
Copy link
Author

@eikenb At the time consul-templaterb was written, the index (especially on services, prior to version 1.0.6 IIRC) was not well behaving (so watches were always waken up when any change in raft did occur). In consul-templaterb, there is an automatic penalty for endpoints being notified too often (see https://github.com/criteo/consul-templaterb/blob/master/lib/consul/async/consul_endpoint.rb#L237 ), meaning that endpoint not changing will have a bigger delay than endpoints really being modified. One of our last PR to improve things (look at most or my PR and those from @aestek) is this one: hashicorp/consul#4810 ... IIRC, this one is very helpful (starting with Consul 1.4.1) since if you watch a service being non-existent, it will avoid you to hammer the cluster for changes on a large cluster, hence all of the optimizations pushed in consul-templaterb.

Most of those optimizations are not really necessary anymore with recent versions since we worked a lot on reducing updates on anti-entropy changes (we have at least 5 PR merged regarding this).

Furthermore, all endpoints in of consul in consul-templaterb are handled the same way, including when there is no watch (eg: https://github.com/criteo/consul-templaterb/blob/master/bin/consul-templaterb#L69 ), i used the same mechanism to configure refresh on endpoint changing value all the time (metrics, /agent/self...) and for which I don't want being notified too often (eg: datacenters that sorts according RTT beween DCs)

This allowed us at the time, on large clusters, to reduce bandwidth from 800Mb/s to 8kb/s.

This is probably not that important today and you probably don't need that kind of granularity anymore since many optimization have been bought to Consul itself.

In the real life, most templating systems don't need to refresh data too often anyway (I mean 640ms is probably enough for anybody 😄 ), so having a safeguard to avoid refreshing every ms the cluster is a good thing and will avoid to DDOS the cluster.

In any case, I think this is a very good thing to integrate this finally in consul-template, thank you :-)

@eikenb
Copy link
Contributor

eikenb commented Aug 31, 2019

@pierresouchay .. Thanks so much for the information dump. Really helped me understand the background of this and why it probably doesn't need to be configurable at all. So that's what I'll start with.

I'll push my version of it up as a new PR soon.

eikenb added a commit that referenced this pull request Sep 9, 2019
Rate limit all API calls to Consul to ~1/100ms to keep consul-template
from flooding consul-agent with requests. This takes the idea from #1066
and simplifies it by making it work this way by default instead of
making it a configurable option.

Closes #1066
eikenb added a commit that referenced this pull request Sep 10, 2019
Rate limit all API calls to Consul to ~1/100ms to keep consul-template
from flooding consul-agent with requests. This takes the idea from #1066
and simplifies it by making it work this way by default instead of
making it a configurable option.

Closes #1066
@eikenb eikenb closed this in #1279 Sep 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants