-
Notifications
You must be signed in to change notification settings - Fork 649
Publish an official crawler policy #1535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r? @rust-lang/crates-io |
This is a formalization of a policy that we've been informally enforcing for some time now. The policy basically boils down to: - Just use the index if you can - Contact us to see if we can help in a way that doesn't require crawling if you can't. - If you do crawl, limit to 1RPS - You also have to provide a user agent, which should actually identify your crawler and have contact information - We may still block you if you cause an impact on the integrity of the service. I chose not to explicitly call out inflating a single crate's download numbers as something that's forbidden, as it felt like doing that would be an instance of ["Don't shove beans up your nose"](https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose) That falls under the umbrella of "impacting the integrity of the service" though, so this policy does give us an explicit reason to block anyone engaging in that behavior.
cfd80a0
to
a35fea9
Compare
I'd prefer something better to a hardcoded number. The idea I have now is to "not make concurrent requests". GitHub's abuse policy may be a good reference. Some highlights are:
|
"Here are some guidelines to avoid hitting our abuse rate limits" isn't really comparable to a general crawler policy. GitHub is an entirely different kind of application, which is expecting most of its traffic to be from API clients. The policy is heavily based on a similar policy from NPM. The limit for bots is intentionally low, as we would much prefer that they use the index or reach out to us with their needs. |
I agree with you that we're different from GitHub. However, I think that the reasons people are crawling is due to the index not containing enough data to run a mirror (like crates.rs) or aggregate stats. I will be working on #630 so we can eliminate most need of crawling. Also this is how we are currently different from NPM. NPM runs a CouchDB instance so that people can download and stream the full registry data anytime they want; therefore we don't need to hit the API. Though, given that we only have 20k crates 1 request per second seems reasonable now. We probably want to revisit it when we implement generic rate limits and/or start to provide database dumps, but otherwise I have no objection to this as a general bot policy. |
Yes, I agree that we should aim to provide better solutions to their needs. That doesn't change the fact that we have limited resources, and we need to prioritize user traffic over bots. |
I would request that, rather than giving a specific rate limit, we instead tell people that the server may throttle excessive requests from a single user, and that they should not attempt to work around that throttling. |
@joshtriplett The reason that we request a rate limit in the policy is that the limit we want to set for bots is much lower than the limit we can actually reasonably set programatically. |
bors: r+ |
1535: Publish an official crawler policy r=sgrif a=sgrif This is a formalization of a policy that we've been informally enforcing for some time now. The policy basically boils down to: - Just use the index if you can - Contact us to see if we can help in a way that doesn't require crawling if you can't. - If you do crawl, limit to 1RPS - You also have to provide a user agent, which should actually identify your crawler and have contact information - We may still block you if you cause an impact on the integrity of the service. I chose not to explicitly call out inflating a single crate's download numbers as something that's forbidden, as it felt like doing that would be an instance of ["Don't shove beans up your nose"](https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose) That falls under the umbrella of "impacting the integrity of the service" though, so this policy does give us an explicit reason to block anyone engaging in that behavior. Co-authored-by: Sean Griffin <sean@seantheprogrammer.com>
Build succeeded |
This is a formalization of a policy that we've been informally enforcing
for some time now. The policy basically boils down to:
crawling if you can't.
your crawler and have contact information
service.
I chose not to explicitly call out inflating a single crate's download
numbers as something that's forbidden, as it felt like doing that would
be an instance of "Don't shove beans up your
nose"
That falls under the umbrella of "impacting the integrity of the
service" though, so this policy does give us an explicit reason to block
anyone engaging in that behavior.