new_audit(blocked-from-indexing): page is blocked from indexing #3657

kdzwinel · 2017-10-26T01:00:09Z

Closes #3182

Failing:

Passing:

patrickhulce

maybe I reviewed too early 😄

seems like there's some weirdness around the user agent bit?

patrickhulce · 2017-10-26T01:37:27Z

lighthouse-core/audits/seo/is-crawlable.js

+    return false;
+  }
+
+  const date = Date.parse(parts[1]);


seems like we would want to do a parts.slice(1).join(':'), no? losing the time and timezone

Oh, good catch! Tests didn't catch that because Date.parse is very forgiving.

patrickhulce · 2017-10-26T01:39:21Z

lighthouse-core/audits/seo/is-crawlable.js

+ */
+function hasUA(directives) {
+  const parts = directives.split(':');
+  return parts.length > 1 && parts[0] !== UNAVAILABLE_AFTER;


should it be looking for a GOOGLEBOT_USER_AGENT const maybe?

let's also rename this then, if it returns false when a UA is specified how about hasNoUserAgent or something

patrickhulce · 2017-10-26T01:48:06Z

lighthouse-core/audits/seo/is-crawlable.js

+        }
+
+        mainResource.responseHeaders
+          .filter(h => h.name.toLowerCase() === ROBOTS_HEADER && !hasUA(h.value) &&


so to be clear, we're looking for the robots header that has a user agent specified and is specified to block?

seems like this might miss a few cases maybe I'm misunderstanding

can there not be multiple directives in the header?

we are looking for robots header that doesn't have UA specified (we don't support UA specific headers) and is blocking indexing

kdzwinel · 2017-10-26T10:33:31Z

@patrickhulce Thank you for a review! With all the tests I'm pretty sure it's working as intended, but I made the UA part very confusing - sorry for that.

We want to ignore all user agent specific tags (we are only looking for <meta type="robots") and headers (all directives prefixed with somebot:). There is one edge case for that last one though unavailable_after: can be a first directive and may be confused for a bot name, that's why I have !== UNAVAILABLE_AFTER check.

I left additional comment. Please let me know if that makes sense to you.

patrickhulce

ah ok makes a lot more sense now sorry for my confusion :)

patrickhulce · 2017-10-26T17:06:03Z

lighthouse-core/audits/seo/is-crawlable.js

+}
+
+/**
+ * Returns false if robots header specifies user agent (e.g. `googlebot: noindex`)


Ah, I see is this comment a typo then? Returns *true* if robots header specifies a user agent?

Right, I forgot to update both comments. Thanks!

patrickhulce · 2017-10-26T17:12:12Z

lighthouse-core/audits/seo/is-crawlable.js

+}
+
+/**
+ * Returns false if any of provided directives blocks page from being indexed


same here, doesn't this return true when any of the directives blocks the page from being indexed?

patrickhulce · 2017-10-26T17:15:52Z

lighthouse-core/test/audits/seo/is-crawlable-test.js

+  it('ignores UA specific directives', () => {
+    const mainResource = {
+      responseHeaders: [
+        {name: 'x-robots-tag', value: 'googlebot: unavailable_after: 25 Jun 2007 15:00:00 PST'},


ah ok, I was confused about how multiple user agent + a default would be expressed. I didn't realize it'd be duplicate headers rather than a csv

would you mind adding a default value here that's valid just for future readers
i.e.

responseHeaders: [ {name: 'x-robots-tag', value: 'googlebot: unavailable_after: 25 Jun 2007 15:00:00 PST'}, {name: 'x-robots-tag', value: 'unavailable_after: 25 Jun 2027 15:00:00 PST'}, ]

rviscomi

just one suggestion, otherwise LGTM 👍

rviscomi · 2017-10-26T19:57:36Z

lighthouse-core/audits/seo/is-crawlable.js

+ * @returns {boolean}
+ */
+function isUnavailable(directive) {
+  const parts = directive.split(':');


Sometimes I find it easier to use array deconstruction in cases like this:

const [key, value] = directive.split(':');

Yeah, I agree that'd be much more elegant, but in this case it won't work:

const [key, value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':');

value in this case would be 12 Jun 2017 12 instead of 12 Jun 2017 12:30:00. I could do:

const [key, ...value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':');

But then, value is an array and I still have to .join(':') it, so not much different from a current solution :(

Would split(':', 1) resolve your concern?

TIL about second parameter of .split!
This gives me access to unavailable_after but doesn't give me the date:

const [key, value] = 'unavailable_after: 12 Jun 2017 12:30:00'.split(':', 1);

value will be empty. Am I missing something here? 🤔

Yeah you're right, it doesn't do what I thought it would. Carry on!

kdzwinel · 2017-10-28T19:15:52Z

@patrickhulce I've addressed your comments 👍 PTAL

patrickhulce

LGTM!

patrickhulce · 2017-10-30T16:33:06Z

lighthouse-core/audits/seo/is-crawlable.js

+  static get meta() {
+    return {
+      name: 'is-crawlable',
+      description: 'Page isn’t blocked from indexing',


I'd prefer to word this more affirmatively (i.e. Page can be indexed or Page is indexable), but I'm guessing we can't because it'd be misleading and there are many other ways a page could be prevented from indexing?

patrickhulce · 2017-10-30T16:34:13Z

@brendankenny there are smokehouse server changes FYI if you wanted to review :)

brendankenny

whoops, hit submit too soon. Just looking at the server changes, a suggestion and a request :)

brendankenny · 2017-10-30T23:35:15Z

lighthouse-cli/test/fixtures/static-server.js

+        extraHeaders = Array.isArray(extraHeaders) ? extraHeaders : [extraHeaders];
+
+        extraHeaders.forEach(header => {
+          const parts = header.split(':');


I actually would prefer const [key, ...value], but that's just down to preference.

You could also do header.split(/:(.+)/);, which should give the correct split (captured groups also appear in the resulting array)

brendankenny · 2017-10-30T23:36:35Z

lighthouse-cli/test/fixtures/static-server.js

+
+        extraHeaders.forEach(header => {
+          const parts = header.split(':');
+          headers[parts[0]] = parts.slice(1).join(':');


this might be complete overkill, but can we make a set of allowed headers and only add to headers if found in there? We block hidden files and anything outside of the working directory, but you never know...

One can't be too careful!

kdzwinel · 2017-10-31T19:58:39Z

@brendankenny header safelist added PTAL

brendankenny

thanks for your patience! LGTM
📃 🚫 🤖 🚼

kdzwinel · 2017-11-08T18:32:28Z

@brendankenny thanks for merging 🙌

…leChrome#3657)

kdzwinel requested review from brendankenny, patrickhulce and paulirish as code owners October 26, 2017 01:00

patrickhulce reviewed Oct 26, 2017

View reviewed changes

rviscomi reviewed Oct 26, 2017

View reviewed changes

patrickhulce added the waiting4committer label Oct 27, 2017

kdzwinel added 7 commits October 29, 2017 20:23

WIP

1336519

All w/o unavailable_after support

8274b33

unavailable_after support, more tests

5de29e8

Better smoke test, changes to the static-server

4deffb8

Added missing SEO audits to default config

0a85233

Fix date parsing, improve UA detection, add more tests

e46bbb6

Fix comments and add additional explanation in one of the tests

d1c2667

kdzwinel force-pushed the seo-blocked-indexing branch from 5bb9d4f to d1c2667 Compare October 29, 2017 19:28

patrickhulce added waiting4reviewer and removed waiting4committer labels Oct 30, 2017

patrickhulce approved these changes Oct 30, 2017

View reviewed changes

patrickhulce assigned brendankenny Oct 30, 2017

brendankenny reviewed Oct 30, 2017

View reviewed changes

brendankenny added waiting4committer and removed waiting4reviewer labels Oct 31, 2017

Adding a header safelist to the static-server.

77470fd

Make search for robots metatag case-insensitive

f3ca775

simha24 approved these changes Nov 5, 2017

View reviewed changes

devtools-bot added the waiting4reviewer label Nov 6, 2017

devtools-bot removed the waiting4committer label Nov 6, 2017

kdzwinel mentioned this pull request Nov 7, 2017

Fork the Chrome extension to demo SEO audits #3656

Closed

brendankenny approved these changes Nov 8, 2017

View reviewed changes

brendankenny merged commit fb2cb02 into GoogleChrome:master Nov 8, 2017

kdzwinel deleted the seo-blocked-indexing branch November 8, 2017 18:32

christhompson pushed a commit to christhompson/lighthouse that referenced this pull request Nov 28, 2017

new_audit(blocked-from-indexing): page is blocked from indexing (Goog…

d4d01dd

…leChrome#3657)

dependencies bot mentioned this pull request Dec 17, 2017

Update lighthouse in / from 2.5.0 to 2.7.0 chauncey-garrett/dotfiles#57

Open

paulirish removed the waiting4reviewer label Mar 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new_audit(blocked-from-indexing): page is blocked from indexing #3657

new_audit(blocked-from-indexing): page is blocked from indexing #3657

kdzwinel commented Oct 26, 2017

patrickhulce left a comment

patrickhulce Oct 26, 2017

kdzwinel Oct 26, 2017

patrickhulce Oct 26, 2017

patrickhulce Oct 26, 2017

kdzwinel Oct 26, 2017

kdzwinel commented Oct 26, 2017 •

edited

Loading

patrickhulce left a comment

patrickhulce Oct 26, 2017

kdzwinel Oct 28, 2017

patrickhulce Oct 26, 2017

patrickhulce Oct 26, 2017

rviscomi left a comment

rviscomi Oct 26, 2017

kdzwinel Oct 28, 2017

rviscomi Oct 28, 2017

kdzwinel Oct 28, 2017

rviscomi Oct 28, 2017

kdzwinel commented Oct 28, 2017

patrickhulce left a comment

patrickhulce Oct 30, 2017

patrickhulce commented Oct 30, 2017

brendankenny left a comment

brendankenny Oct 30, 2017

brendankenny Oct 30, 2017

kdzwinel Oct 31, 2017

kdzwinel commented Oct 31, 2017

brendankenny left a comment

kdzwinel commented Nov 8, 2017

new_audit(blocked-from-indexing): page is blocked from indexing #3657

new_audit(blocked-from-indexing): page is blocked from indexing #3657

Conversation

kdzwinel commented Oct 26, 2017

patrickhulce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdzwinel commented Oct 26, 2017 • edited Loading

patrickhulce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rviscomi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdzwinel commented Oct 28, 2017

patrickhulce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickhulce commented Oct 30, 2017

brendankenny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdzwinel commented Oct 31, 2017

brendankenny left a comment

Choose a reason for hiding this comment

kdzwinel commented Nov 8, 2017

kdzwinel commented Oct 26, 2017 •

edited

Loading