-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow setting to change default consistency value for HTTP requests #3911
Comments
Just for the record, here is the effect of setting default stale value to stale on one of our Consul Cluster: The 3 top lines are the "user" CPU time of our 3 consul servers, the 9 bottom lines around 0 are "system", softirq and wait CPU times on those servers as well. All the user CPU time is used by the consul server process (nothing else significant is running on the 3 consul servers). The inflexion relates to the time where we applied the patch (to disable consistent read by default).
After the patch:
Note that most of the excessive load was just due to one single app (prometheus) performing default queries. While we suspected this issue quickly, it took us week to validate it because we had no easy way to ensure our hypothesis was the right one. While Prometheus will eventually get patched (see prometheus/prometheus#3814 ), we know are running our fork of prometheus, but having an option for administrator to protect by default Consul leader would be a great step Having an option would have allowed us to quickly mitigate this issue without recompiling and running our forked version. |
Thanks for sharing all these details. Like you say it's really a problem for the client to solve in that there is already a way this can work fine and any Consul integration should probably offer flexibility around staleness unless it absolutely breaks the guarantees required. I can may be see as a pragmatic step where there are lots of bad clients you don't directly control having Even that per-agent setting is a potential foot gun as it masks behaviour for some services but not others that might be forgotten and then the wrong conclusions about Consul's underlying safety guarantees might be drawn. In some ways it would be cleaner to fix cases like this outside of the agent which may also be running other services. You could for instance setup a proxy that rewrites the query string with a trivial nginx config or something. Harder than a config param I agree but more explicit and less room for confusion IMO, especially if it's only a stop-gap until the clients can be patched. What do you think? I can see some value but not sure if it warrants the complex changes to consistency API you described. |
Hello Paul @banks , This can definitely be achieved thru an indirection, ie nginx or other but it is not very practical: we have more than 5k nodes on each DC, some running windows/Unix, so we would have to write our own proxy compatible with each operating system, use another port than 8500 to run Consul, run the proxy on port 8500 and forward the request on Consul agent. While doable, it is a bit challenging. Most people here are using the local agent for discovery. Some are using python, ruby, C#, Java, it is of course possible to ensure all those libs are good citizen, but sometimes, a user is just performing a simple call on the agent I used to write a small Consul API a few time and did the same mistake, for for a developper now well familiar with Consul, forgetting to put While DNS allow to specify stale read by default (https://www.consul.io/docs/agent/options.html#allow_stale ), this kind of use case is very frequent for us (client side load balancing...) and having a way to control this would be a great plus for us. Maybe my proposal is not enough and we could envision to setup default_value per path (for instance /health/* and /catalog/* -> default := stale while keeping /kv/* -> leader or other use cases. But for instance, all Consul agent listing datacenters using /catalog/datacenters and requiring consistency looks like a spoil of ressources (since we do not add datacenters every day for most of us). It does not matters probably for small clusters, but having control on such default behaviour would be really helpful for large scale Consul deployments I think. We had really hard time with this issue, and are still working with forks of both Consul and Prometheus in order to have our services properly running, but I am pretty sure we are not the only ones having both prometheus and Consul on large scale. Whenever HashiCorp decide one day to make default being stale or not is a detail, but as soon as we introduce a new parameter to with back to default (what I propose as Kind Regards |
Thanks for the info. I can see the benefits in your case. It's not very clear to me how many users are in similar position with both extreme scale and relatively little control of clients, but it's a valid concern to want to improve on somehow. Good point that we effectively do the same thing with "allow_stale" on DNS although in general DNS is often used with TTLs and so maybe has different expectations from user that it would be strongly consistent. Speaking personally, I wonder what the most minimal change we could make would be that would satisfy your case? Preferably without deprecating consistency API renaming things etc. I understand that it would make it impossible to get current default behaviour again but that might not be such a big issue and I see it as a somewhat last-resort setting that most people won't touch and that will come with a fair handful of gotchas however we implement it. Not changing the external API would make it a much easier to get merged IMO as it's just another config flag - we can document the tradeoffs and let operator decide. Would that help you? |
Also linking for others following the thread, we already have a coarse-grained per-agent rate limiter that can be used to prevent specific agents from issuing too many queries: https://www.consul.io/docs/agent/options.html#limits I realise that's a different thing to allowing it but making them Fundamentally we can't prevent enough badly behaved clients from DDOSing servers though - if you run untrusted clients that's something you'll need to factor in and maybe solve with your own abstraction on top of Consul that does per-customer rate limiting or similar. Ultimately I think a simple config with no other changes that we can document as "use at your own risk" is maybe something the team would consider but more invasive changes seem hard to justify as it seems kind of edge casey to us and actively malicious clients can always contravene the default setting anyway. |
Thanks @banks and @pierresouchay for this conversation, it is very pleasant to follow. My 2 cents:
In any medium sized company like ours (~400 r&d people), control that can be exerced on clients is very limited. Variety of languages & newcomers onboarded everyday makes every effort to document or control clients daunting. Stability of the clusters falls onto the shoulders of both clients (not making too many requests) and service owners (us) to protect all clients. Consul has limited isolation features between clients, per-agent limits are a very good step in this direction but not enough to have complete control since all requests are not created equal (stale / nonstale for instance) and agent can be shared between several clients running on the same machine (in a mesos or k8s cluster for instance). Feature suggested by @pierresouchay makes sense to me to help service owners have better control over guarantees provided to users (default consistency of queries). |
Following conversation hashicorp#3911 This patch allow to change the default behavior of Consul regarding consistency
Implemented in #4004 |
For several months, our Consul Servers leaders have been suffering (100% CPU continuously) very hard. For the context, we have 7 connected Consul Clusters with up to 5k nodes per Cluster, globally, we have around 1.3k different kind of services in each data-center. Each Consul Cluster has 3 servers (we also did a few tries by setting 5 consul servers in the cluster without any change on the load of the leader) and only the leader was affected by the issue (CPU was 2x or 3x more used on the leaders than on the followers)
This hi CPU usage also cause lots of issues regarding latency (registering services took up to 4 seconds registering) and even distributed outages (when the Consul leader having all the ACLs was too heavily loaded, others data-centers could not fetch their ACLs and it caused outtages in Remote Data-centers). Hi CPU usages could cause flapping between leaders (up to several elections every minutes) and caused lots of issues.
We finally figured out what was happening: we installed several Prometheus instances in all our data-centers, all Prometheus instances using Consul discovery to discover services in both its local DC as well as in other DC. Each of the prometheus instances was doing lots of non-stale requests on discovery in all data-centers. Thus all our promethus instances were doing requests only to the leaders of each datacenters and DDosing our cluster.
One of my colleague did a PR on Prometheus to fix this (prometheus/prometheus#3814 -> This patch basically try to avoid listing all services all the time by filtering using tags when retrieving the list of services and adds the &stale argument to each service query) and we are also trying to improve it on our side (#3899 ) regarding hi-usage of watches on lots of services.
But since both Consul and Prometheus are very important components of our infrastructure and we could not afford to loose any of those components, we had to build our own version of Consul with a in-house patch to completely disable non-stale requests on the agents running prometheus.
Deploying our own patched Consul version with non-stale reads had an immediate: our cluster is now healthy, registrations and reads are now in a few milliseconds instead of a few seconds, but it highlight a fair point: many libraries writer are not aware or do not care about adding a
&stale
argument while performing requests to their agent and can very quickly DO kill Consul performance.We understand that changing the semantics of Consul regarding non-stale read is an issue since it breaks the API contact, but it would be nice to add the following:
"default_consitency_mode": leader|stale|consistent
. Default value if not set, would be "leader" to ensure compatibilityI am OK to create a PR to implement this, what do you think?
It really would help fixing issues like the one we have and would allow administrator to protect their cluster and to specify a good policy regarding reads
The text was updated successfully, but these errors were encountered: