Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate limit the crate publish endpoint #1596

Merged
merged 3 commits into from
Feb 14, 2019
Merged

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Jan 16, 2019

This adds a restriction to the number of crates that can be published
from a single IP address in a single time period. This is done per IP
instead of per-user, since that's what we can do with nginx alone. We
may want to do additional rate limiting per token to force spammers to
register additional accounts, and not just change their IP.

This will use what's called the "leaky bucket" strategy for rate
limiting. Basically every user has a bucket of tokens they use to
publish crates. They start with 10 tokens available. Every time they hit
this endpoint, they use one of the tokens. We give them a new token each
minute.

What this means is that you can upload 1 crate per minute, but we allow
you to upload up to 10 in a short period of time before we start
enforcing the rate limit.

When someone does hit the rate limit, they will receive a 429 response.
We could also allow it to instead just slow down the requests, refusing
to process them until a token is available (queueing a max of 10 requests).

This reserves 10 megabyte of memory for the IP table, which means we can
hold about 80000 IPs. When the table is full, it will attempt to drop
the oldest record, and if that doesn't give enough space, it'll give a
503. Keep in mind this is all in memory, not shared between our servers.
This means that it is possible (but not guaranteed) that someone can
upload 20 crates, and then send 2 requests per minute.

On Heroku and any system that involves proxies, the remote_addr of the
request becomes useless, as it will be the proxy forwarding your request
rather than the client itself. Most proxies will set or append an
X-Forwarded-For header, putting whatever remote_addr it received at
the end of the the header. So if there are 3 proxies, the header will
look like

real_ip, proxy1, proxy2

However, if we're not careful this makes us vulnerable to IP spoofing. A
lot of implementations just look at the first value in the header. All
the proxies just append the header though (they don't know if they're
getting traffic from the origin or a proxy), so you end up with

spoofed_ip, real_ip, proxy1, proxy2

nginx, Rails, and many other implementations avoid this by requiring an
allowlist of trusted IPs. With this configuration, the realip only kicks
in if remote_addr is a trusted proxy, and then it will replace it with
the last non-trusted IP from the header.

This adds a restriction to the number of crates that can be published
from a single IP address in a single time period. This is done per IP
instead of per-user, since that's what we can do with nginx alone. We
may want to do additional rate limiting per token to force spammers to
register additional accounts, and not just change their IP.

This will use what's called the "leaky bucket" strategy for rate
limiting. Basically every user has a bucket of tokens they use to
publish crates. They start with 10 tokens available. Every time they hit
this endpoint, they use one of the tokens. We give them a new token each
minute.

What this means is that you can upload 1 crate per minute, but we allow
you to upload up to 10 in a short period of time before we start
enforcing the rate limit.

When someone does hit the rate limit, they will receive a 429 response
(which is TOO MANY REQUESTS but since it's a newer status code I'm not
sure if nginx actually knows that's the text description). We could also
allow it to instead just slow down the requests, refusing to process
them until a token is available (queueing a max of 10 requests).

This reserves 10 megabyte of memory for the IP table, which means we can
hold about 80000 IPs. When the table is full, it will attempt to drop
the oldest record, and if that doesn't give enough space, it'll give a
503. Keep in mind this is all in memory, not shared between our servers.
This means that it is possible (but not guaranteed) that someone can
upload 20 crates, and then send 2 requests per minute.

On Heroku and any system that involves proxies, the remote_addr of the
request becomes useless, as it will be the proxy forwarding your request
rather than the client itself. Most proxies will set or append an
`X-Forwarded-For` header, putting whatever remote_addr it received at
the end of the the header. So if there are 3 proxies, the header will
look like

real_ip, proxy1, proxy2

However, if we're not careful this makes us vulnerable to IP spoofing. A
lot of implementations just look at the first value in the header. All
the proxies just append the header though (they don't know if they're
getting traffic from the origin or a proxy), so you end up with

spoofed_ip, real_ip, proxy1, proxy2

nginx, Rails, and many other implementations avoid this by requiring an
allowlist of trusted IPs. With this configuration, the realip only kicks
in if remote_addr is a trusted proxy, and then it will replace it with
the last non-trusted IP from the header.
@sgrif
Copy link
Contributor Author

sgrif commented Jan 16, 2019

I've given this some local testing to make sure everything is working properly (as best as I'm able to locally). I've confirmed that it is in fact setting the X-Real-Ip header to what it's expected to be, that the publish endpoint is rate limited, that all other endpoints are not rate limited, and that the changes to where the various proxy headers still works as before.

We'll still want to test this on staging before it gets deployed though

This was not intended to be deleted
Copy link
Member

@ashleygwilliams ashleygwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjacent to this code- as per our convo in the triage meeting i think it would be beneficial to document this on the policies page!

@sgrif
Copy link
Contributor Author

sgrif commented Jan 17, 2019

Note on above: I agree, but the wording is tricky and IMO should not block landing this

@ashleygwilliams
Copy link
Member

i always prefer to use PRs as a forcing function for docs- very happy to help wordsmith here if you'd like (or in PMs)

@jtgeibel
Copy link
Member

I've tested this in my staging area. The IP address shows correctly in the logs and after making 10 requests to the publish endpoint I started getting responses with a 429 status.

This was on the heroku-16 stack. The build pack pulls in nginx-1.9.5 on that stack. I'm not sure what stack we are on in production, but it looks like Heroku updated the buildpack to nginx-1.14.1 on heroku-18, which is nice.

@sgrif
Copy link
Contributor Author

sgrif commented Jan 28, 2019

We talked a while back about policies, but it sorta stalled out. I've pushed up what I think would make the most sense here. I've put it under "Package Ownership" since this doesn't really have anything to do with crawlers. The wording "to claim ownership of a large number of packages" is deliberate, as it's the only wording I could come up with that separates the cases we care about from "we use a CI tool or a script to publish our suite of 10 crates" which we don't want to prevent.

@jtgeibel
Copy link
Member

jtgeibel commented Feb 2, 2019

I'm r+ on the policy language, and this PR overall.

@sgrif
Copy link
Contributor Author

sgrif commented Feb 14, 2019

@bors r+

@bors
Copy link
Contributor

bors commented Feb 14, 2019

📌 Commit ac4d090 has been approved by sgrif

bors added a commit that referenced this pull request Feb 14, 2019
Rate limit the crate publish endpoint

This adds a restriction to the number of crates that can be published
from a single IP address in a single time period. This is done per IP
instead of per-user, since that's what we can do with nginx alone. We
may want to do additional rate limiting per token to force spammers to
register additional accounts, and not just change their IP.

This will use what's called the "leaky bucket" strategy for rate
limiting. Basically every user has a bucket of tokens they use to
publish crates. They start with 10 tokens available. Every time they hit
this endpoint, they use one of the tokens. We give them a new token each
minute.

What this means is that you can upload 1 crate per minute, but we allow
you to upload up to 10 in a short period of time before we start
enforcing the rate limit.

When someone does hit the rate limit, they will receive a 429 response.
We could also allow it to instead just slow down the requests, refusing
to process them until a token is available (queueing a max of 10 requests).

This reserves 10 megabyte of memory for the IP table, which means we can
hold about 80000 IPs. When the table is full, it will attempt to drop
the oldest record, and if that doesn't give enough space, it'll give a
503. Keep in mind this is all in memory, not shared between our servers.
This means that it is possible (but not guaranteed) that someone can
upload 20 crates, and then send 2 requests per minute.

On Heroku and any system that involves proxies, the remote_addr of the
request becomes useless, as it will be the proxy forwarding your request
rather than the client itself. Most proxies will set or append an
`X-Forwarded-For` header, putting whatever remote_addr it received at
the end of the the header. So if there are 3 proxies, the header will
look like

real_ip, proxy1, proxy2

However, if we're not careful this makes us vulnerable to IP spoofing. A
lot of implementations just look at the first value in the header. All
the proxies just append the header though (they don't know if they're
getting traffic from the origin or a proxy), so you end up with

spoofed_ip, real_ip, proxy1, proxy2

nginx, Rails, and many other implementations avoid this by requiring an
allowlist of trusted IPs. With this configuration, the realip only kicks
in if remote_addr is a trusted proxy, and then it will replace it with
the last non-trusted IP from the header.
@bors
Copy link
Contributor

bors commented Feb 14, 2019

⌛ Testing commit ac4d090 with merge 1aa4b1d...

@bors
Copy link
Contributor

bors commented Feb 14, 2019

☀️ Test successful - checks-travis
Approved by: sgrif
Pushing 1aa4b1d to master...

@bors bors merged commit ac4d090 into rust-lang:master Feb 14, 2019
@sgrif sgrif deleted the sg-realip branch March 9, 2019 01:34
CPerezz added a commit to CPerezz/cargo that referenced this pull request Jan 31, 2021
In rust-lang/crates.io#1596 it was added
a rate limit for crate publishing endpoint connections (10 components/min).
As is said in the PR. The strategy used allows to upload 10 components
first and then the ratio starts to be applied.
Now, if the rate is surpassed, crates.io returns a 429 HTTP response.

Cargo, tries to publish all of the workspace/crate components as fast
as possible (as we should expect) but when we have more than 10
components and we try to publish the 11th, usually a minute hasn't
passed yet and therefore we get a HTTP-429.

This made some users to experience confusing/not well explained error
messages as seen in rust-lang/crates.io#1643.

The limit got increased to 30 components/min. But who knows..

So this closes rust-lang#6714 while
a better solution is not found.
Turbo87 added a commit to Turbo87/crates.io that referenced this pull request Oct 17, 2023
The configuration for this nginx module was original added in rust-lang#1596, in combination with an IP-based `limit_req` rate limit. The rate limit was lately moved to the application layer (see rust-lang#6875).

Since we no longer use the `limit_req` rate limit in the nginx config, it looks like we don't need the `real_ip` config anymore either.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants