Skip to content

Publish an official crawler policy #1535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 3, 2018

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Oct 23, 2018

This is a formalization of a policy that we've been informally enforcing
for some time now. The policy basically boils down to:

  • Just use the index if you can
  • Contact us to see if we can help in a way that doesn't require
    crawling if you can't.
  • If you do crawl, limit to 1RPS
  • You also have to provide a user agent, which should actually identify
    your crawler and have contact information
  • We may still block you if you cause an impact on the integrity of the
    service.

I chose not to explicitly call out inflating a single crate's download
numbers as something that's forbidden, as it felt like doing that would
be an instance of "Don't shove beans up your
nose"

That falls under the umbrella of "impacting the integrity of the
service" though, so this policy does give us an explicit reason to block
anyone engaging in that behavior.

@sgrif
Copy link
Contributor Author

sgrif commented Oct 23, 2018

r? @rust-lang/crates-io

@sgrif sgrif requested a review from ashleygwilliams October 23, 2018 22:16
This is a formalization of a policy that we've been informally enforcing
for some time now. The policy basically boils down to:

- Just use the index if you can
- Contact us to see if we can help in a way that doesn't require
  crawling if you can't.
- If you do crawl, limit to 1RPS
- You also have to provide a user agent, which should actually identify
  your crawler and have contact information
- We may still block you if you cause an impact on the integrity of the
  service.

I chose not to explicitly call out inflating a single crate's download
numbers as something that's forbidden, as it felt like doing that would
be an instance of ["Don't shove beans up your
nose"](https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose)
That falls under the umbrella of "impacting the integrity of the
service" though, so this policy does give us an explicit reason to block
anyone engaging in that behavior.
@sgrif sgrif force-pushed the sg-crawler-policy branch from cfd80a0 to a35fea9 Compare October 23, 2018 22:16
@ishitatsuyuki
Copy link
Contributor

If you do crawl, limit to 1RPS

I'd prefer something better to a hardcoded number. The idea I have now is to "not make concurrent requests".

GitHub's abuse policy may be a good reference. Some highlights are:

  • They recommend requesting serially for a single user ID. (Maybe this is related to how they shard data?)
  • They do recommend you to slow down requests at a specified rate but only for mutations.
  • They emphasizes taking care of the Retry-After header which is a HTTP standard.

@sgrif
Copy link
Contributor Author

sgrif commented Oct 24, 2018

"Here are some guidelines to avoid hitting our abuse rate limits" isn't really comparable to a general crawler policy. GitHub is an entirely different kind of application, which is expecting most of its traffic to be from API clients. The policy is heavily based on a similar policy from NPM. The limit for bots is intentionally low, as we would much prefer that they use the index or reach out to us with their needs.

@ishitatsuyuki
Copy link
Contributor

I agree with you that we're different from GitHub.

However, I think that the reasons people are crawling is due to the index not containing enough data to run a mirror (like crates.rs) or aggregate stats. I will be working on #630 so we can eliminate most need of crawling. Also this is how we are currently different from NPM. NPM runs a CouchDB instance so that people can download and stream the full registry data anytime they want; therefore we don't need to hit the API.

Though, given that we only have 20k crates 1 request per second seems reasonable now. We probably want to revisit it when we implement generic rate limits and/or start to provide database dumps, but otherwise I have no objection to this as a general bot policy.

@sgrif
Copy link
Contributor Author

sgrif commented Oct 24, 2018

Yes, I agree that we should aim to provide better solutions to their needs. That doesn't change the fact that we have limited resources, and we need to prioritize user traffic over bots.

@joshtriplett
Copy link
Member

I would request that, rather than giving a specific rate limit, we instead tell people that the server may throttle excessive requests from a single user, and that they should not attempt to work around that throttling.

@sgrif
Copy link
Contributor Author

sgrif commented Oct 24, 2018

@joshtriplett The reason that we request a rate limit in the policy is that the limit we want to set for bots is much lower than the limit we can actually reasonably set programatically.

@sgrif
Copy link
Contributor Author

sgrif commented Nov 3, 2018

bors: r+

bors-voyager bot added a commit that referenced this pull request Nov 3, 2018
1535: Publish an official crawler policy r=sgrif a=sgrif

This is a formalization of a policy that we've been informally enforcing
for some time now. The policy basically boils down to:

- Just use the index if you can
- Contact us to see if we can help in a way that doesn't require
  crawling if you can't.
- If you do crawl, limit to 1RPS
- You also have to provide a user agent, which should actually identify
  your crawler and have contact information
- We may still block you if you cause an impact on the integrity of the
  service.

I chose not to explicitly call out inflating a single crate's download
numbers as something that's forbidden, as it felt like doing that would
be an instance of ["Don't shove beans up your
nose"](https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose)
That falls under the umbrella of "impacting the integrity of the
service" though, so this policy does give us an explicit reason to block
anyone engaging in that behavior.

Co-authored-by: Sean Griffin <sean@seantheprogrammer.com>
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Nov 3, 2018

Build succeeded

@bors-voyager bors-voyager bot merged commit a35fea9 into rust-lang:master Nov 3, 2018
@sgrif sgrif deleted the sg-crawler-policy branch March 9, 2019 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants