-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return 429 response on HTTP max connection limit #13621
Conversation
Instead of silently closing the connection, return a `429 Too Many Requests` HTTP response with a helpful error message to aid debugging when the connection limit is unintentionally reached.
I found that Consul returns a 429 response on max-connection limit (hashicorp/consul#8221) and uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wjordan! This PR effectively reverts a PR made by my colleague @shoenig in f0dff3f, where we weren't happy with how the default 429 handler interacted with our handshake.
I'm working on rate limiting in #13480 which will send a 429 code as you'd expect. I can make sure we're introducing that where we can for connection tracking as well.
I'm going to close this PR, but thanks so much for contributing!
Thanks for pointing me to the previous discussion. The concern in #9608 (review) seems to be that writing 429 responses to the It seems the DoS concern is mostly theoretical (I was unable to reproduce any DoS in my own local testing), and I think it's based on some unclear assumptions about what might cause a Beyond this, rejecting this PR in Nomad raises an issue about Consul's current behavior- if this behavior was considered a DoS issue in Nomad, does that mean that it's currently a DoS issue for Consul? At the very least I'd like to see the various Hashicorp products agree on some consistent behavior across them, whatever it may be. |
Hey @wjordan I suspect you're right that blocking sends aren't an issue, but I'm also not sure that's actually the problem. If we simply drop the connection we don't need to do the TLS handshake (but we also can't send the 429). But if we accept the connection and then immediately close it after sending the 429 we've already done an awful lot of work for the TLS handshake and that's the primary thing the connection rate limiting is intended to protect against. Once you've successfully connected it doesn't really matter how many connections you have; you can DoS the cluster by sending a ton of work to it (which is what #13480 is for). I'm going to ping my colleagues @shoenig and @schmichael here, as they both worked on the original implementation and the PR this PR reverts. They might have more thoughts.
Agreed. |
The rate-limit prevents the TLS handshake necessary to write the HTTP response from consuming too many server resources. Add "Too many concurrent connections" log warning when connection limit is reached.
Count of HTTP connections exceeding concurrency limit.
Please see wjordan/nomad#http_max_conn_limit_429 for followup commits (3372c07, 9768e42) that should address the concerns raised, by rate-limiting the 429 responses and adding a (not rate-limited) counter for tracking the total number of limit-exceeding connection attempts. Please let me know if this headed in a helpful direction and I will open a new PR. |
This PRtl;dr -> 👍 Hi @wjordan! Thanks for sticking with this! Security-sensitive issues are always tricky, and obviously we've gone back and forth on it ourselves. In the absence of unique factors, Nomad follows Consul's lead on such matters, so I think we can accept this PR and match their behavior. It obviously increases risk somewhat, but the current behavior obviously causes significant pain for legitimate users and is very difficult to debug. I'm willing to take the risk for the sake of the common case. FollowupsAs far as the followup commits are concerned, I think the metric would be great. 👍 I like your approach to logging as it limits the rate to 10/s. AFAICT from reading https://pkg.go.dev/golang.org/x/time/rate#NewLimiter the 100 burst limit will not take effect because we only call I think we can just set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @schmichael!
@wjordan I've kicked CI to re-run the test flakes but it looked like lint-go was still failing for goimport -d
. I've done that and added a changelog entry. Once CI is green again we'll get this merged. Thanks for the PR and being patient with us figuring this out! 😀
This has been merged and will ship in the next version of Nomad! |
🎉 Thanks for seeing this through! I understand the extra scrutiny required for security-sensitive issues and appreciate the extra attention. Note that I had pushed the followup commits (adding rate-limiting and logging/metric) to this PR, so they were included in the merge-commit. It sounded like @schmichael had requested some slight changes (or at least some further discussion) on the followups, so just wanted to make sure the full merge was intended.
That would make sense for the log entries, but a rate-limit too low could make 429 responses inconsistent if multiple processes are trying to establish connections. Maybe adding a separate 1/s rate-limit for log entries would make sense, even if it adds a bit of extra complexity? |
Oh, sorry, I should have addressed those specifically and forgot to:
Arguably having the log line there lets someone spam logs but in typical production environments where you've got journald going, it'll drop excessive logging anyways. We can always come back and add a rate limit here if we feel more strongly about it later. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
The
limits.http_max_conns_per_client
configuration option sets a limit on the max number of concurrent TCP connections per client IP address:nomad/website/content/docs/configuration/index.mdx
Lines 197 to 200 in 0b3ba5c
The current implementation behavior silently closes the TCP connection, resulting in opaque 'unexpected EOF' or 'connection reset' errors which many users fail to correctly diagnose, and is likely the root cause of various reported Nomad issues.
Instead of silently closing the connection, this PR returns a
429 Too Many Requests
HTTP response with a helpful error message to aid debugging when the connection limit is unintentionally reached. This should help diagnose/fix some instances of #12273 and #8718, where the fix is simply to increase the connection limit from the default of 100.The default of 100 is easily hit when doing anything with blocking queries, since each blocking query requires a separate HTTP client connection. One example from my use-case: the Nomad Autoscaler monitors each job status using a blocking query, which means that any autoscaler agent monitoring more than 100 jobs will exceed this default limit in normal operation and will need to be increased.
One tradeoff in this PR is that the connection is kept open slightly longer as the agent writes out the 429 response. The
HTTPConnStateFuncWithDefault429Handler
function accepts a write-deadline argument, which I've set as0 (no deadline)(updated:10 * time.millisecond
to match existing behavior in Consul) in this PR, this could alternatively be made configurable through an additional config option, if desired.