-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new_audit(blocked-from-indexing): page is blocked from indexing #3657
new_audit(blocked-from-indexing): page is blocked from indexing #3657
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I reviewed too early 😄
seems like there's some weirdness around the user agent bit?
return false; | ||
} | ||
|
||
const date = Date.parse(parts[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we would want to do a parts.slice(1).join(':')
, no? losing the time and timezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch! Tests didn't catch that because Date.parse
is very forgiving.
*/ | ||
function hasUA(directives) { | ||
const parts = directives.split(':'); | ||
return parts.length > 1 && parts[0] !== UNAVAILABLE_AFTER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be looking for a GOOGLEBOT_USER_AGENT
const maybe?
let's also rename this then, if it returns false when a UA is specified how about hasNoUserAgent
or something
} | ||
|
||
mainResource.responseHeaders | ||
.filter(h => h.name.toLowerCase() === ROBOTS_HEADER && !hasUA(h.value) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so to be clear, we're looking for the robots header that has a user agent specified and is specified to block?
seems like this might miss a few cases maybe I'm misunderstanding
can there not be multiple directives in the header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are looking for robots header that doesn't have UA specified (we don't support UA specific headers) and is blocking indexing
@patrickhulce Thank you for a review! With all the tests I'm pretty sure it's working as intended, but I made the We want to ignore all user agent specific tags (we are only looking for I left additional comment. Please let me know if that makes sense to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok makes a lot more sense now sorry for my confusion :)
} | ||
|
||
/** | ||
* Returns false if robots header specifies user agent (e.g. `googlebot: noindex`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see is this comment a typo then? Returns *true* if robots header specifies a user agent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I forgot to update both comments. Thanks!
} | ||
|
||
/** | ||
* Returns false if any of provided directives blocks page from being indexed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, doesn't this return true
when any of the directives blocks the page from being indexed?
it('ignores UA specific directives', () => { | ||
const mainResource = { | ||
responseHeaders: [ | ||
{name: 'x-robots-tag', value: 'googlebot: unavailable_after: 25 Jun 2007 15:00:00 PST'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, I was confused about how multiple user agent + a default would be expressed. I didn't realize it'd be duplicate headers rather than a csv
would you mind adding a default value here that's valid just for future readers
i.e.
responseHeaders: [
{name: 'x-robots-tag', value: 'googlebot: unavailable_after: 25 Jun 2007 15:00:00 PST'},
{name: 'x-robots-tag', value: 'unavailable_after: 25 Jun 2027 15:00:00 PST'},
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one suggestion, otherwise LGTM 👍
* @returns {boolean} | ||
*/ | ||
function isUnavailable(directive) { | ||
const parts = directive.split(':'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes I find it easier to use array deconstruction in cases like this:
const [key, value] = directive.split(':');
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that'd be much more elegant, but in this case it won't work:
const [key, value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':');
value in this case would be 12 Jun 2017 12
instead of 12 Jun 2017 12:30:00
. I could do:
const [key, ...value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':');
But then, value
is an array and I still have to .join(':')
it, so not much different from a current solution :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would split(':', 1)
resolve your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about second parameter of .split
!
This gives me access to unavailable_after
but doesn't give me the date:
const [key, value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':', 1);
value
will be empty. Am I missing something here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, it doesn't do what I thought it would. Carry on!
@patrickhulce I've addressed your comments 👍 PTAL |
5bb9d4f
to
d1c2667
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
static get meta() { | ||
return { | ||
name: 'is-crawlable', | ||
description: 'Page isn’t blocked from indexing', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to word this more affirmatively (i.e. Page can be indexed
or Page is indexable
), but I'm guessing we can't because it'd be misleading and there are many other ways a page could be prevented from indexing?
@brendankenny there are smokehouse server changes FYI if you wanted to review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, hit submit too soon. Just looking at the server changes, a suggestion and a request :)
extraHeaders = Array.isArray(extraHeaders) ? extraHeaders : [extraHeaders]; | ||
|
||
extraHeaders.forEach(header => { | ||
const parts = header.split(':'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually would prefer const [key, ...value]
, but that's just down to preference.
You could also do header.split(/:(.+)/);
, which should give the correct split (captured groups also appear in the resulting array)
|
||
extraHeaders.forEach(header => { | ||
const parts = header.split(':'); | ||
headers[parts[0]] = parts.slice(1).join(':'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be complete overkill, but can we make a set of allowed headers and only add to headers
if found in there? We block hidden files and anything outside of the working directory, but you never know...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can't be too careful!
@brendankenny header safelist added PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your patience! LGTM
📃 🚫 🤖 🚼
@brendankenny thanks for merging 🙌 |
Closes #3182
Failing:
Passing: