Simple rate limiting for agent rpc calls. #3140
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
This PR is related to OOM outage that we encountered on our production.
Details can be found here: https://groups.google.com/forum/#!topic/consul-tool/YYnhFeC1qi4
After some discussion we got feedback from @slackpad that you are interested in some basic form of rate limiting for consul agents to prevent abusive clients from causing the cluster to become unstable.
In this PR only agents with client mode are limited. Rate limiting for simplicity covers all rpc calls regardless if came from dns, http or sync.
Rate limiting is based on https://godoc.org/golang.org/x/time/rate.
Configured by RPCRate and RPCMaxBurst. We recommend these values not to be lower than expected number of services registered locally on a single consul agent. By default RPCRate is set to infinite and RPCMaxBurst is ignored.
There is also two new metrics added:
consul.client.rpc.rate
: rate of rpc calls before rate limiting, can be used with a threshold to trigger monitoring events when agent is getting closer to its limitconsul.client.rpc.exceeded.rate
: how often client is rate limited, good candidate to trigger monitoring events that agent is actually rate limitedSample configuration:
{"rpc_rate": 100, "rpc_max_burst": 100}