Publish an official crawler policy #1535

sgrif · 2018-10-23T16:31:19Z

This is a formalization of a policy that we've been informally enforcing
for some time now. The policy basically boils down to:

Just use the index if you can
Contact us to see if we can help in a way that doesn't require
crawling if you can't.
If you do crawl, limit to 1RPS
You also have to provide a user agent, which should actually identify
your crawler and have contact information
We may still block you if you cause an impact on the integrity of the
service.

I chose not to explicitly call out inflating a single crate's download
numbers as something that's forbidden, as it felt like doing that would
be an instance of "Don't shove beans up your
nose"
That falls under the umbrella of "impacting the integrity of the
service" though, so this policy does give us an explicit reason to block
anyone engaging in that behavior.

sgrif · 2018-10-23T16:31:35Z

r? @rust-lang/crates-io

This is a formalization of a policy that we've been informally enforcing for some time now. The policy basically boils down to: - Just use the index if you can - Contact us to see if we can help in a way that doesn't require crawling if you can't. - If you do crawl, limit to 1RPS - You also have to provide a user agent, which should actually identify your crawler and have contact information - We may still block you if you cause an impact on the integrity of the service. I chose not to explicitly call out inflating a single crate's download numbers as something that's forbidden, as it felt like doing that would be an instance of ["Don't shove beans up your nose"](https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose) That falls under the umbrella of "impacting the integrity of the service" though, so this policy does give us an explicit reason to block anyone engaging in that behavior.

ishitatsuyuki · 2018-10-24T00:40:59Z

If you do crawl, limit to 1RPS

I'd prefer something better to a hardcoded number. The idea I have now is to "not make concurrent requests".

GitHub's abuse policy may be a good reference. Some highlights are:

They recommend requesting serially for a single user ID. (Maybe this is related to how they shard data?)
They do recommend you to slow down requests at a specified rate but only for mutations.
They emphasizes taking care of the Retry-After header which is a HTTP standard.

sgrif · 2018-10-24T00:49:57Z

"Here are some guidelines to avoid hitting our abuse rate limits" isn't really comparable to a general crawler policy. GitHub is an entirely different kind of application, which is expecting most of its traffic to be from API clients. The policy is heavily based on a similar policy from NPM. The limit for bots is intentionally low, as we would much prefer that they use the index or reach out to us with their needs.

ishitatsuyuki · 2018-10-24T01:10:31Z

I agree with you that we're different from GitHub.

However, I think that the reasons people are crawling is due to the index not containing enough data to run a mirror (like crates.rs) or aggregate stats. I will be working on #630 so we can eliminate most need of crawling. Also this is how we are currently different from NPM. NPM runs a CouchDB instance so that people can download and stream the full registry data anytime they want; therefore we don't need to hit the API.

Though, given that we only have 20k crates 1 request per second seems reasonable now. We probably want to revisit it when we implement generic rate limits and/or start to provide database dumps, but otherwise I have no objection to this as a general bot policy.

sgrif · 2018-10-24T01:14:09Z

Yes, I agree that we should aim to provide better solutions to their needs. That doesn't change the fact that we have limited resources, and we need to prioritize user traffic over bots.

joshtriplett · 2018-10-24T12:13:41Z

I would request that, rather than giving a specific rate limit, we instead tell people that the server may throttle excessive requests from a single user, and that they should not attempt to work around that throttling.

sgrif · 2018-10-24T12:25:36Z

@joshtriplett The reason that we request a rate limit in the policy is that the limit we want to set for bots is much lower than the limit we can actually reasonably set programatically.

sgrif · 2018-11-03T22:05:28Z

bors: r+

1535: Publish an official crawler policy r=sgrif a=sgrif This is a formalization of a policy that we've been informally enforcing for some time now. The policy basically boils down to: - Just use the index if you can - Contact us to see if we can help in a way that doesn't require crawling if you can't. - If you do crawl, limit to 1RPS - You also have to provide a user agent, which should actually identify your crawler and have contact information - We may still block you if you cause an impact on the integrity of the service. I chose not to explicitly call out inflating a single crate's download numbers as something that's forbidden, as it felt like doing that would be an instance of ["Don't shove beans up your nose"](https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose) That falls under the umbrella of "impacting the integrity of the service" though, so this policy does give us an explicit reason to block anyone engaging in that behavior. Co-authored-by: Sean Griffin <sean@seantheprogrammer.com>

bors-voyager · 2018-11-03T22:35:06Z

Build succeeded

continuous-integration/travis-ci/push

sgrif requested a review from ashleygwilliams October 23, 2018 22:16

sgrif force-pushed the sg-crawler-policy branch from cfd80a0 to a35fea9 Compare October 23, 2018 22:16

jtgeibel approved these changes Oct 24, 2018

View reviewed changes

carols10cents approved these changes Oct 24, 2018

View reviewed changes

bors-voyager bot merged commit a35fea9 into rust-lang:master Nov 3, 2018

sgrif deleted the sg-crawler-policy branch March 9, 2019 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Publish an official crawler policy #1535

Publish an official crawler policy #1535

Uh oh!

sgrif commented Oct 23, 2018

Uh oh!

sgrif commented Oct 23, 2018

Uh oh!

ishitatsuyuki commented Oct 24, 2018

Uh oh!

sgrif commented Oct 24, 2018

Uh oh!

ishitatsuyuki commented Oct 24, 2018

Uh oh!

sgrif commented Oct 24, 2018

Uh oh!

joshtriplett commented Oct 24, 2018

Uh oh!

sgrif commented Oct 24, 2018

Uh oh!

sgrif commented Nov 3, 2018

Uh oh!

bors-voyager bot commented Nov 3, 2018

Uh oh!

Uh oh!

Publish an official crawler policy #1535

Publish an official crawler policy #1535

Uh oh!

Conversation

sgrif commented Oct 23, 2018

Uh oh!

sgrif commented Oct 23, 2018

Uh oh!

ishitatsuyuki commented Oct 24, 2018

Uh oh!

sgrif commented Oct 24, 2018

Uh oh!

ishitatsuyuki commented Oct 24, 2018

Uh oh!

sgrif commented Oct 24, 2018

Uh oh!

joshtriplett commented Oct 24, 2018

Uh oh!

sgrif commented Oct 24, 2018

Uh oh!

sgrif commented Nov 3, 2018

Uh oh!

bors-voyager bot commented Nov 3, 2018

Build succeeded

Uh oh!

Uh oh!