-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate limit the crate publish endpoint #1596
Conversation
This adds a restriction to the number of crates that can be published from a single IP address in a single time period. This is done per IP instead of per-user, since that's what we can do with nginx alone. We may want to do additional rate limiting per token to force spammers to register additional accounts, and not just change their IP. This will use what's called the "leaky bucket" strategy for rate limiting. Basically every user has a bucket of tokens they use to publish crates. They start with 10 tokens available. Every time they hit this endpoint, they use one of the tokens. We give them a new token each minute. What this means is that you can upload 1 crate per minute, but we allow you to upload up to 10 in a short period of time before we start enforcing the rate limit. When someone does hit the rate limit, they will receive a 429 response (which is TOO MANY REQUESTS but since it's a newer status code I'm not sure if nginx actually knows that's the text description). We could also allow it to instead just slow down the requests, refusing to process them until a token is available (queueing a max of 10 requests). This reserves 10 megabyte of memory for the IP table, which means we can hold about 80000 IPs. When the table is full, it will attempt to drop the oldest record, and if that doesn't give enough space, it'll give a 503. Keep in mind this is all in memory, not shared between our servers. This means that it is possible (but not guaranteed) that someone can upload 20 crates, and then send 2 requests per minute. On Heroku and any system that involves proxies, the remote_addr of the request becomes useless, as it will be the proxy forwarding your request rather than the client itself. Most proxies will set or append an `X-Forwarded-For` header, putting whatever remote_addr it received at the end of the the header. So if there are 3 proxies, the header will look like real_ip, proxy1, proxy2 However, if we're not careful this makes us vulnerable to IP spoofing. A lot of implementations just look at the first value in the header. All the proxies just append the header though (they don't know if they're getting traffic from the origin or a proxy), so you end up with spoofed_ip, real_ip, proxy1, proxy2 nginx, Rails, and many other implementations avoid this by requiring an allowlist of trusted IPs. With this configuration, the realip only kicks in if remote_addr is a trusted proxy, and then it will replace it with the last non-trusted IP from the header.
I've given this some local testing to make sure everything is working properly (as best as I'm able to locally). I've confirmed that it is in fact setting the We'll still want to test this on staging before it gets deployed though |
This was not intended to be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjacent to this code- as per our convo in the triage meeting i think it would be beneficial to document this on the policies page!
Note on above: I agree, but the wording is tricky and IMO should not block landing this |
i always prefer to use PRs as a forcing function for docs- very happy to help wordsmith here if you'd like (or in PMs) |
I've tested this in my staging area. The IP address shows correctly in the logs and after making 10 requests to the publish endpoint I started getting responses with a 429 status. This was on the heroku-16 stack. The build pack pulls in nginx-1.9.5 on that stack. I'm not sure what stack we are on in production, but it looks like Heroku updated the buildpack to nginx-1.14.1 on heroku-18, which is nice. |
We talked a while back about policies, but it sorta stalled out. I've pushed up what I think would make the most sense here. I've put it under "Package Ownership" since this doesn't really have anything to do with crawlers. The wording "to claim ownership of a large number of packages" is deliberate, as it's the only wording I could come up with that separates the cases we care about from "we use a CI tool or a script to publish our suite of 10 crates" which we don't want to prevent. |
I'm r+ on the policy language, and this PR overall. |
@bors r+ |
📌 Commit ac4d090 has been approved by |
Rate limit the crate publish endpoint This adds a restriction to the number of crates that can be published from a single IP address in a single time period. This is done per IP instead of per-user, since that's what we can do with nginx alone. We may want to do additional rate limiting per token to force spammers to register additional accounts, and not just change their IP. This will use what's called the "leaky bucket" strategy for rate limiting. Basically every user has a bucket of tokens they use to publish crates. They start with 10 tokens available. Every time they hit this endpoint, they use one of the tokens. We give them a new token each minute. What this means is that you can upload 1 crate per minute, but we allow you to upload up to 10 in a short period of time before we start enforcing the rate limit. When someone does hit the rate limit, they will receive a 429 response. We could also allow it to instead just slow down the requests, refusing to process them until a token is available (queueing a max of 10 requests). This reserves 10 megabyte of memory for the IP table, which means we can hold about 80000 IPs. When the table is full, it will attempt to drop the oldest record, and if that doesn't give enough space, it'll give a 503. Keep in mind this is all in memory, not shared between our servers. This means that it is possible (but not guaranteed) that someone can upload 20 crates, and then send 2 requests per minute. On Heroku and any system that involves proxies, the remote_addr of the request becomes useless, as it will be the proxy forwarding your request rather than the client itself. Most proxies will set or append an `X-Forwarded-For` header, putting whatever remote_addr it received at the end of the the header. So if there are 3 proxies, the header will look like real_ip, proxy1, proxy2 However, if we're not careful this makes us vulnerable to IP spoofing. A lot of implementations just look at the first value in the header. All the proxies just append the header though (they don't know if they're getting traffic from the origin or a proxy), so you end up with spoofed_ip, real_ip, proxy1, proxy2 nginx, Rails, and many other implementations avoid this by requiring an allowlist of trusted IPs. With this configuration, the realip only kicks in if remote_addr is a trusted proxy, and then it will replace it with the last non-trusted IP from the header.
☀️ Test successful - checks-travis |
In rust-lang/crates.io#1596 it was added a rate limit for crate publishing endpoint connections (10 components/min). As is said in the PR. The strategy used allows to upload 10 components first and then the ratio starts to be applied. Now, if the rate is surpassed, crates.io returns a 429 HTTP response. Cargo, tries to publish all of the workspace/crate components as fast as possible (as we should expect) but when we have more than 10 components and we try to publish the 11th, usually a minute hasn't passed yet and therefore we get a HTTP-429. This made some users to experience confusing/not well explained error messages as seen in rust-lang/crates.io#1643. The limit got increased to 30 components/min. But who knows.. So this closes rust-lang#6714 while a better solution is not found.
The configuration for this nginx module was original added in rust-lang#1596, in combination with an IP-based `limit_req` rate limit. The rate limit was lately moved to the application layer (see rust-lang#6875). Since we no longer use the `limit_req` rate limit in the nginx config, it looks like we don't need the `real_ip` config anymore either.
This adds a restriction to the number of crates that can be published
from a single IP address in a single time period. This is done per IP
instead of per-user, since that's what we can do with nginx alone. We
may want to do additional rate limiting per token to force spammers to
register additional accounts, and not just change their IP.
This will use what's called the "leaky bucket" strategy for rate
limiting. Basically every user has a bucket of tokens they use to
publish crates. They start with 10 tokens available. Every time they hit
this endpoint, they use one of the tokens. We give them a new token each
minute.
What this means is that you can upload 1 crate per minute, but we allow
you to upload up to 10 in a short period of time before we start
enforcing the rate limit.
When someone does hit the rate limit, they will receive a 429 response.
We could also allow it to instead just slow down the requests, refusing
to process them until a token is available (queueing a max of 10 requests).
This reserves 10 megabyte of memory for the IP table, which means we can
hold about 80000 IPs. When the table is full, it will attempt to drop
the oldest record, and if that doesn't give enough space, it'll give a
503. Keep in mind this is all in memory, not shared between our servers.
This means that it is possible (but not guaranteed) that someone can
upload 20 crates, and then send 2 requests per minute.
On Heroku and any system that involves proxies, the remote_addr of the
request becomes useless, as it will be the proxy forwarding your request
rather than the client itself. Most proxies will set or append an
X-Forwarded-For
header, putting whatever remote_addr it received atthe end of the the header. So if there are 3 proxies, the header will
look like
real_ip, proxy1, proxy2
However, if we're not careful this makes us vulnerable to IP spoofing. A
lot of implementations just look at the first value in the header. All
the proxies just append the header though (they don't know if they're
getting traffic from the origin or a proxy), so you end up with
spoofed_ip, real_ip, proxy1, proxy2
nginx, Rails, and many other implementations avoid this by requiring an
allowlist of trusted IPs. With this configuration, the realip only kicks
in if remote_addr is a trusted proxy, and then it will replace it with
the last non-trusted IP from the header.