-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added rate_limit support to avoid sending too many requests to Consul Agent #1066
Conversation
Rate limit support: hashicorp#1066
Any news on this ? |
Given consul-template is built on watches, we wonder if the suggested model should be to run it with Are you relying on de-duplication mode, which is disabled under |
@pearkes -once would have a worse behaviour on large clusters for some templates, because each time you would run -once, you would have to re-download everything If we want to build a large template (for instance, watch all services having tag "http", for each, create a list of endpoint to create a HaProxy / nginx configuration file) => -once is not an option => well it is, but then having the full file rendered would take a lot of time and would not allow to:
I agree this is not optimal, but it simply renders consul-template not usable for large clusters. On our, when using it, the DL speed between consul-template and its Consul Agent was around 800Mb/sec (and around the same between Consul Agent and Consul Server) Since hashicorp/consul#3890 has now been merged, the situation is probably far better (except for /catalog/services and friends). With this implementation https://github.com/criteo/consul-templaterb/ , for the same cluster, we succeeded to have around 5Mbits/sec instead of 800Mbits/sec, after optimization hashicorp/consul#3934 => we are around 2Mbits/sec while watching at all services. I definitely think we could improve a bit the current implementation. I'll try with consul-template to see what is the current bandwidth after hashicorp/consul#3934 has been applied |
Little bump on this as hashicorp/consul#3934 has been closed |
I am also interested in this work, given that Without the |
@sodabrew be my guest ;) That's the main reason why I developed https://github.com/criteo/consul-templaterb it reduces the load on servers and bandwidth with orders of magnitude on large clusters while allowing far more features: real language, loops, parametrization, json, XML and even a fully customizable US UI with great performance on very large clusters: https://github.com/criteo/consul-templaterb/tree/master/samples/consul-ui |
What's the downside to having this enabled? I ask as given the description and general sentiment makes it sounds like this is a simple win. Why make it configurable? Why not just add a sensible limit all the time? I don't believe in making things configurable just because... there needs to be real, tangible trade-offs that people need to manage for something to justify the complexity of being configurable. Thanks. |
I would say none. The benefits have been reduced since we made patches in Consul itself, however, it is still a win for pathological cases |
@sodabrew, @guidoiaquinti, @gmr, and anyone else who cares. Would you agree with @pierresouchay that there is no downside to this? If you agree, I propose we just make this the new behavior and skip the config options. |
I'd suggest that changing the behavior as non-optional should be made in major version changes, not minor or revision. Don't get me wrong, I like the new options, but someone might be relying on current behavior and taken by surprise if it were to change in a point release. |
@gmr the default behavior is probably not very wise (and has been changed in Consul in version 1.0.7+ for services ) |
I guess if it's documented and clear that it's a change in behavior (for the better), then go for it? |
Ark... now conflicting... :( |
I'm still leaning towards just making this the default behavior and the fact that this has conflicts just reinforces that as it highlights just how much code this has to touch to make it configurable. If not configurable it looks like it'd only mean changing 1 file (view.go). What I'm thinking basically boils down to taking the logic from RateLimitFunc, put it a function in view.go and call it in the same place as in the patch. I'd guess with the current default values in place (set as vars to allow changing it for tests). The current default values are:
Are these good default values? |
Assuming we go the non-configurable route. @pierresouchay, is that something you'd want to implement (maybe as a separate PR for comparison) or something I'd just take care of? |
I would be glad to implement this, but won't have the time to in next 5 weeks, so if you need it sooner, feel free to implement. Regards |
I have a simplified version of this implemented that only has the little bit of code that actually does the rate limiting in watch/view.go. The current implementation has the base rate limiting value (MinDelayBetweenUpdates in the patch) as a module level variable so it can be 0'd out for tests as needed. I was looking over the use/implementation in consul-templateb and it really looks like there might be some variety of values that are useful for the delay... so I was re-thinking the possibility of a simple command-line only option to set just the delay. This would let you disable it (by setting it to 0) as well as setting it more based on need. I'd keep it on by default, with the default value as is (100ms). Having it cli only would keep the changes down as well. Any thoughts on this? |
@eikenb At the time consul-templaterb was written, the index (especially on services, prior to version 1.0.6 IIRC) was not well behaving (so watches were always waken up when any change in raft did occur). In consul-templaterb, there is an automatic penalty for endpoints being notified too often (see https://github.com/criteo/consul-templaterb/blob/master/lib/consul/async/consul_endpoint.rb#L237 ), meaning that endpoint not changing will have a bigger delay than endpoints really being modified. One of our last PR to improve things (look at most or my PR and those from @aestek) is this one: hashicorp/consul#4810 ... IIRC, this one is very helpful (starting with Consul 1.4.1) since if you watch a service being non-existent, it will avoid you to hammer the cluster for changes on a large cluster, hence all of the optimizations pushed in consul-templaterb. Most of those optimizations are not really necessary anymore with recent versions since we worked a lot on reducing updates on anti-entropy changes (we have at least 5 PR merged regarding this). Furthermore, all endpoints in of consul in consul-templaterb are handled the same way, including when there is no watch (eg: https://github.com/criteo/consul-templaterb/blob/master/bin/consul-templaterb#L69 ), i used the same mechanism to configure refresh on endpoint changing value all the time (metrics, /agent/self...) and for which I don't want being notified too often (eg: datacenters that sorts according RTT beween DCs) This allowed us at the time, on large clusters, to reduce bandwidth from 800Mb/s to 8kb/s. This is probably not that important today and you probably don't need that kind of granularity anymore since many optimization have been bought to Consul itself. In the real life, most templating systems don't need to refresh data too often anyway (I mean 640ms is probably enough for anybody 😄 ), so having a safeguard to avoid refreshing every ms the cluster is a good thing and will avoid to DDOS the cluster. In any case, I think this is a very good thing to integrate this finally in consul-template, thank you :-) |
@pierresouchay .. Thanks so much for the information dump. Really helped me understand the background of this and why it probably doesn't need to be configurable at all. So that's what I'll start with. I'll push my version of it up as a new PR soon. |
This PR adds support for:
rate_limit.min_delay_between_updates
rate_limit.random_backoff
This should avoid allow limiting the bandwidth used by consul-template on large clusters with lots of services when watching many services (and X-Consul-Index being updated very frequently)
cf #1065