Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_audit(blocked-from-indexing): page is blocked from indexing #3657

Merged
merged 9 commits into from
Nov 8, 2017

Conversation

kdzwinel
Copy link
Collaborator

Closes #3182

Failing:
screen shot 2017-10-26 at 02 40 53

Passing:
screen shot 2017-10-26 at 02 42 27

Copy link
Collaborator

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I reviewed too early 😄

seems like there's some weirdness around the user agent bit?

return false;
}

const date = Date.parse(parts[1]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we would want to do a parts.slice(1).join(':'), no? losing the time and timezone

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch! Tests didn't catch that because Date.parse is very forgiving.

*/
function hasUA(directives) {
const parts = directives.split(':');
return parts.length > 1 && parts[0] !== UNAVAILABLE_AFTER;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be looking for a GOOGLEBOT_USER_AGENT const maybe?

let's also rename this then, if it returns false when a UA is specified how about hasNoUserAgent or something

}

mainResource.responseHeaders
.filter(h => h.name.toLowerCase() === ROBOTS_HEADER && !hasUA(h.value) &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to be clear, we're looking for the robots header that has a user agent specified and is specified to block?

seems like this might miss a few cases maybe I'm misunderstanding

can there not be multiple directives in the header?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are looking for robots header that doesn't have UA specified (we don't support UA specific headers) and is blocking indexing

@kdzwinel
Copy link
Collaborator Author

kdzwinel commented Oct 26, 2017

@patrickhulce Thank you for a review! With all the tests I'm pretty sure it's working as intended, but I made the UA part very confusing - sorry for that.

We want to ignore all user agent specific tags (we are only looking for <meta type="robots") and headers (all directives prefixed with somebot:). There is one edge case for that last one though unavailable_after: can be a first directive and may be confused for a bot name, that's why I have !== UNAVAILABLE_AFTER check.

I left additional comment. Please let me know if that makes sense to you.

Copy link
Collaborator

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok makes a lot more sense now sorry for my confusion :)

}

/**
* Returns false if robots header specifies user agent (e.g. `googlebot: noindex`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see is this comment a typo then? Returns *true* if robots header specifies a user agent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I forgot to update both comments. Thanks!

}

/**
* Returns false if any of provided directives blocks page from being indexed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, doesn't this return true when any of the directives blocks the page from being indexed?

it('ignores UA specific directives', () => {
const mainResource = {
responseHeaders: [
{name: 'x-robots-tag', value: 'googlebot: unavailable_after: 25 Jun 2007 15:00:00 PST'},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I was confused about how multiple user agent + a default would be expressed. I didn't realize it'd be duplicate headers rather than a csv

would you mind adding a default value here that's valid just for future readers
i.e.

responseHeaders: [
  {name: 'x-robots-tag', value: 'googlebot: unavailable_after: 25 Jun 2007 15:00:00 PST'},
  {name: 'x-robots-tag', value: 'unavailable_after: 25 Jun 2027 15:00:00 PST'},
]

Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one suggestion, otherwise LGTM 👍

* @returns {boolean}
*/
function isUnavailable(directive) {
const parts = directive.split(':');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I find it easier to use array deconstruction in cases like this:

const [key, value] = directive.split(':');

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that'd be much more elegant, but in this case it won't work:

const [key, value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':');

value in this case would be 12 Jun 2017 12 instead of 12 Jun 2017 12:30:00. I could do:

const [key, ...value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':');

But then, value is an array and I still have to .join(':') it, so not much different from a current solution :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would split(':', 1) resolve your concern?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about second parameter of .split!
This gives me access to unavailable_after but doesn't give me the date:

const [key, value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':', 1);

value will be empty. Am I missing something here? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, it doesn't do what I thought it would. Carry on!

@kdzwinel
Copy link
Collaborator Author

@patrickhulce I've addressed your comments 👍 PTAL

Copy link
Collaborator

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

static get meta() {
return {
name: 'is-crawlable',
description: 'Page isn’t blocked from indexing',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to word this more affirmatively (i.e. Page can be indexed or Page is indexable), but I'm guessing we can't because it'd be misleading and there are many other ways a page could be prevented from indexing?

@patrickhulce
Copy link
Collaborator

@brendankenny there are smokehouse server changes FYI if you wanted to review :)

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, hit submit too soon. Just looking at the server changes, a suggestion and a request :)

extraHeaders = Array.isArray(extraHeaders) ? extraHeaders : [extraHeaders];

extraHeaders.forEach(header => {
const parts = header.split(':');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually would prefer const [key, ...value], but that's just down to preference.

You could also do header.split(/:(.+)/);, which should give the correct split (captured groups also appear in the resulting array)


extraHeaders.forEach(header => {
const parts = header.split(':');
headers[parts[0]] = parts.slice(1).join(':');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be complete overkill, but can we make a set of allowed headers and only add to headers if found in there? We block hidden files and anything outside of the working directory, but you never know...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can't be too careful!

@kdzwinel
Copy link
Collaborator Author

@brendankenny header safelist added PTAL

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your patience! LGTM
📃 🚫 🤖 🚼

@brendankenny brendankenny merged commit fb2cb02 into GoogleChrome:master Nov 8, 2017
@kdzwinel
Copy link
Collaborator Author

kdzwinel commented Nov 8, 2017

@brendankenny thanks for merging 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants