Implement X-Robots-tag headers (avoid dev/stage websites "leaking" into SERPs) #804

maniqui · 2011-10-10T20:14:31Z

Short version: On public websites being developed, disallowing crawlers on robots.txt isn't enough.
See:
http://yoast.com/prevent-site-being-indexed/
http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html
Matt Cutt (from Google Webmaster Central) on this topic:
http://www.youtube.com/watch?v=KBdEwpRQRD0

Long version:

Yes, compliant robots won't crawl it, but that doesn't exactly means that the website (well, its URL) won't get indexed.
Matt Cutt explains it fairly well in the linked video, but here's the short version: if your website gets linked from somewhere else (other websites?) out there, Google may then list your website's URL in SERPs, even if Google respects your robots.txt.

At least, I've experienced (probably, I'm the guilty one for being careless, for not doing the robots.txt disallow dance) one or two times that development websites are "leaked" into Google SERPs.
Why? I can't tell exactly.
Maybe I or other developer or the client "talked" about the dev website (dev.example.com) on some email exchange via GMail, and thus, the Big Google Brother "got to know" about the existence of dev.example.com...

So, once your development website gets leaked into SERPs, you probably want to remove it from there.
Doing the robots.txt disallow dance may not help on that, as your website already leaked.
Thus, X-Robots-tag headers may be a good approach to solve the situation, imo.
Even more, this approach is not to be used in tandem with disallowing crawlers via robots.txt.

From Google's docs on Control Crawling and Indexing:

Combining crawling with indexing / serving directives
Robots meta tags and X-Robots-Tag HTTP headers are discovered when a URL is crawled. If a page is disallowed from crawling through the robots.txt file, then any information about indexing or serving directives will not be found and will therefore be ignored. If indexing or serving directives must be followed, the URLs containing those directives cannot be disallowed from crawling.

So, here is the proposed snippet to add in .htaccess.
Please, feel free to improve the wording I've used to try to explain this.
In any case, if it doesn't fit into the goal/philosophy of H5BP, I expect others could benefit from this knowledge.

# -----------------------------------------------------------------------------------------
# Disable URL indexing by crawlers (FOR DEVELOPMENT/STAGE)
# -----------------------------------------------------------------------------------------

# Avoid search engines (Google, Yahoo, etc) indexing website's content
# http://yoast.com/prevent-site-being-indexed/
# http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html
# Matt Cutt (from Google Webmaster Central) on this topic:
# http://www.youtube.com/watch?v=KBdEwpRQRD0

# IMPORTANT: serving this headers is recommended only for
# development/stage websites (or for live websites that don't
# want to be indexed). This will avoid the website
# being indexed in SERPs (search engines result pages).
# This is a better approach than using robots.txt
# to disallow the SE robots crawling your website,
# because disallowing the robots doesn't exactly
# mean that your website won't get indexed (read links above).

# <IfModule mod_headers.c>
#   Header set X-Robots-Tag "noindex, nofollow, noarchive"
#   <FilesMatch "\.(doc|pdf|png|jpe?g|gif)$">
#     Header set X-Robots-Tag "noindex, noarchive, nosnippet"
#   </FilesMatch>
# </IfModule>

The text was updated successfully, but these errors were encountered:

chuanxshi · 2011-10-14T17:38:57Z

it may not be needed in most development cases, but good if this snippet could be added to doc or wiki as a 'nice to have' for some use cases.

nimbupani · 2011-11-06T22:33:43Z

@maniqui could you add this to the wiki in the make it better section?

maniqui · 2011-11-07T15:27:07Z

@nimbupani Done, under the .htaccess section I've added a link to this issue. Now, I'll close it.

nimbupani · 2011-11-07T15:54:52Z

Thanks Julián!

tanwill · 2017-01-30T17:16:32Z

Sorry, how does this discriminate between development and live sites?

I am looking for a way to discriminate between the two using a conditional statement, that way the X-Robots tag doesn't have to be manually taken out of the .htaccess file when deployed to live.

Will this work?

roblarsen · 2017-01-30T17:38:08Z

@tanwill The snippet you quote doesn't discriminate between development and live sites.

The FilesMatch directive limits the scope of the enclosed directive to files that meet the matched pattern. The entire block sets headers every time and then the enclosed FilesMatch directive sets different headers for a subset of files.

Also, off the top of my head I don't think there's any way to do what you want to do. But, I could be wrong. Stack Overflow might be a good place to search to see if there's some way to do that.

tanwill · 2017-01-30T17:45:28Z

Thank you, @roblarsen.

tanwill · 2017-01-30T18:47:02Z

I just found a way to do this, @roblarsen.

Instead of editing the .htaccess file, just edit header.php. Put the conditional there:

<?php 
    if($_SERVER['SERVER_NAME']=='DEV SITE') {
        header('X-Robots-Tag: noindex, nofollow'); 
    } else if ($_SERVER['SERVER_NAME']=='LIVE SITE') {
        header('X-Robots-Tag: index, archive'); 
    }
?>

roblarsen · 2017-01-30T21:35:24Z

Ah, yeah, of course you can do that sort of logic in PHP. I was referring to doing it in this context- in htaccess. Glad to see that you got your problem sorted out.

tanwill · 2017-01-30T21:45:12Z

Thanks. Yeah, it's a workaround. I think you're right in that there may not be a non-convoluted way to do this in .htaccess.

agrohs · 2018-05-20T16:43:05Z

I would tweak the regex ever so slightly to capture a few additional file types (.woff, .woff2, .ttf for fonts, and .ppt, .pptx/.xls, .xlsx/.doc, .docx, .dot, .dotx/ for documents and match .htm as well as .html) and use it as:
woff2?|ttf|xlsx?|pptx?|do(c|t)x?|svg|xml|css|js|php|html?|pdf|png|jpe?g|gif in the FilesMatch

jonom · 2020-02-14T20:24:47Z

You could try something like this for a .htaccess-only solution, adjusting the regex to suit your needs. This example would deny crawling for site.com and beta.site.com but allow crawling on www.site.com.

# Only allow crawling of 'www.' urls
SetEnvIfNoCase Host "^www" crawlable
Header set X-Robots-Tag "noindex, nofollow" env=!crawlable

maniqui closed this as completed Nov 7, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement X-Robots-tag headers (avoid dev/stage websites "leaking" into SERPs) #804

Implement X-Robots-tag headers (avoid dev/stage websites "leaking" into SERPs) #804

maniqui commented Oct 10, 2011

chuanxshi commented Oct 14, 2011

nimbupani commented Nov 6, 2011

maniqui commented Nov 7, 2011

nimbupani commented Nov 7, 2011

tanwill commented Jan 30, 2017 •

edited

Loading

roblarsen commented Jan 30, 2017

tanwill commented Jan 30, 2017

tanwill commented Jan 30, 2017 •

edited

Loading

roblarsen commented Jan 30, 2017

tanwill commented Jan 30, 2017

agrohs commented May 20, 2018

jonom commented Feb 14, 2020

Implement X-Robots-tag headers (avoid dev/stage websites "leaking" into SERPs) #804

Implement X-Robots-tag headers (avoid dev/stage websites "leaking" into SERPs) #804

Comments

maniqui commented Oct 10, 2011

chuanxshi commented Oct 14, 2011

nimbupani commented Nov 6, 2011

maniqui commented Nov 7, 2011

nimbupani commented Nov 7, 2011

tanwill commented Jan 30, 2017 • edited Loading

roblarsen commented Jan 30, 2017

tanwill commented Jan 30, 2017

tanwill commented Jan 30, 2017 • edited Loading

roblarsen commented Jan 30, 2017

tanwill commented Jan 30, 2017

agrohs commented May 20, 2018

jonom commented Feb 14, 2020

tanwill commented Jan 30, 2017 •

edited

Loading

tanwill commented Jan 30, 2017 •

edited

Loading