-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core(is-crawlable): make sure that page is not blocked by robots.txt file #4548
Conversation
return response.text() | ||
.then(content => ({status: response.status, content})); | ||
}) | ||
.catch(_ => ({status: null, content: null})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second part of #4356 is a new audit that checks if robots.txt file is valid. That audit should fail if fetch returns HTTP 500+ and that's why we collect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the PR. just one question:
right now we don't really expose where the blocking directive came from... meta tag, resp header or robots.txt. feels like knowing that, in the case of a failure, would be really useful. wdyt?
Also... looking at the robots-parser test suite, do you think there are any additional cases they should add? i'm looking at https://github.com/ChristopherAkroyd/robots-txt-parser/tree/master/test/test-data https://github.com/python/cpython/blob/master/Lib/test/test_robotparser.py https://github.com/meanpath/robots/tree/master/test |
@paulirish we do show where the blocking directive comes from (header, meta, robots.txt) - check out the screenshot at the top. I'd be great to show which robots.txt directive is blocking, but Regarding tests - I quickly scanned test suites that you linked and |
First part of #4356
Extends is-crawlable audit, so that it also verifies robots.txt file (besides meta tag and header).
This change adds a new dependency - robots-parser. It's a lightweight, well tested robots.txt parser with no dependencies.