Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement X-Robots-tag headers (avoid dev/stage websites "leaking" into SERPs) #804

Closed
maniqui opened this issue Oct 10, 2011 · 12 comments
Closed

Comments

@maniqui
Copy link

maniqui commented Oct 10, 2011

Short version: On public websites being developed, disallowing crawlers on robots.txt isn't enough.
See:
http://yoast.com/prevent-site-being-indexed/
http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html
Matt Cutt (from Google Webmaster Central) on this topic:
http://www.youtube.com/watch?v=KBdEwpRQRD0

Long version:

Yes, compliant robots won't crawl it, but that doesn't exactly means that the website (well, its URL) won't get indexed.
Matt Cutt explains it fairly well in the linked video, but here's the short version: if your website gets linked from somewhere else (other websites?) out there, Google may then list your website's URL in SERPs, even if Google respects your robots.txt.

At least, I've experienced (probably, I'm the guilty one for being careless, for not doing the robots.txt disallow dance) one or two times that development websites are "leaked" into Google SERPs.
Why? I can't tell exactly.
Maybe I or other developer or the client "talked" about the dev website (dev.example.com) on some email exchange via GMail, and thus, the Big Google Brother "got to know" about the existence of dev.example.com...

So, once your development website gets leaked into SERPs, you probably want to remove it from there.
Doing the robots.txt disallow dance may not help on that, as your website already leaked.
Thus, X-Robots-tag headers may be a good approach to solve the situation, imo.
Even more, this approach is not to be used in tandem with disallowing crawlers via robots.txt.

From Google's docs on Control Crawling and Indexing:

Combining crawling with indexing / serving directives
Robots meta tags and X-Robots-Tag HTTP headers are discovered when a URL is crawled. If a page is disallowed from crawling through the robots.txt file, then any information about indexing or serving directives will not be found and will therefore be ignored. If indexing or serving directives must be followed, the URLs containing those directives cannot be disallowed from crawling.

So, here is the proposed snippet to add in .htaccess.
Please, feel free to improve the wording I've used to try to explain this.
In any case, if it doesn't fit into the goal/philosophy of H5BP, I expect others could benefit from this knowledge.

# -----------------------------------------------------------------------------------------
# Disable URL indexing by crawlers (FOR DEVELOPMENT/STAGE)
# -----------------------------------------------------------------------------------------

# Avoid search engines (Google, Yahoo, etc) indexing website's content
# http://yoast.com/prevent-site-being-indexed/
# http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html
# Matt Cutt (from Google Webmaster Central) on this topic:
# http://www.youtube.com/watch?v=KBdEwpRQRD0

# IMPORTANT: serving this headers is recommended only for
# development/stage websites (or for live websites that don't
# want to be indexed). This will avoid the website
# being indexed in SERPs (search engines result pages).
# This is a better approach than using robots.txt
# to disallow the SE robots crawling your website,
# because disallowing the robots doesn't exactly
# mean that your website won't get indexed (read links above).

# <IfModule mod_headers.c>
#   Header set X-Robots-Tag "noindex, nofollow, noarchive"
#   <FilesMatch "\.(doc|pdf|png|jpe?g|gif)$">
#     Header set X-Robots-Tag "noindex, noarchive, nosnippet"
#   </FilesMatch>
# </IfModule>
@chuanxshi
Copy link
Member

it may not be needed in most development cases, but good if this snippet could be added to doc or wiki as a 'nice to have' for some use cases.

@nimbupani
Copy link
Member

@maniqui could you add this to the wiki in the make it better section?

@maniqui
Copy link
Author

maniqui commented Nov 7, 2011

@nimbupani Done, under the .htaccess section I've added a link to this issue. Now, I'll close it.

@maniqui maniqui closed this as completed Nov 7, 2011
@nimbupani
Copy link
Member

Thanks Julián!

@tanwill
Copy link

tanwill commented Jan 30, 2017

<FilesMatch ".(doc|pdf|png|jpe?g|gif)$">

Sorry, how does this discriminate between development and live sites?

I am looking for a way to discriminate between the two using a conditional statement, that way the X-Robots tag doesn't have to be manually taken out of the .htaccess file when deployed to live.

Will this work?

@roblarsen
Copy link
Member

@tanwill The snippet you quote doesn't discriminate between development and live sites.

The FilesMatch directive limits the scope of the enclosed directive to files that meet the matched pattern. The entire block sets headers every time and then the enclosed FilesMatch directive sets different headers for a subset of files.

Also, off the top of my head I don't think there's any way to do what you want to do. But, I could be wrong. Stack Overflow might be a good place to search to see if there's some way to do that.

@tanwill
Copy link

tanwill commented Jan 30, 2017

Thank you, @roblarsen.

@tanwill
Copy link

tanwill commented Jan 30, 2017

I just found a way to do this, @roblarsen.

Instead of editing the .htaccess file, just edit header.php. Put the conditional there:

<?php 
    if($_SERVER['SERVER_NAME']=='DEV SITE') {
        header('X-Robots-Tag: noindex, nofollow'); 
    } else if ($_SERVER['SERVER_NAME']=='LIVE SITE') {
        header('X-Robots-Tag: index, archive'); 
    }
?>

@roblarsen
Copy link
Member

Ah, yeah, of course you can do that sort of logic in PHP. I was referring to doing it in this context- in htaccess. Glad to see that you got your problem sorted out.

@tanwill
Copy link

tanwill commented Jan 30, 2017

Thanks. Yeah, it's a workaround. I think you're right in that there may not be a non-convoluted way to do this in .htaccess.

@agrohs
Copy link

agrohs commented May 20, 2018

I would tweak the regex ever so slightly to capture a few additional file types (.woff, .woff2, .ttf for fonts, and .ppt, .pptx/.xls, .xlsx/.doc, .docx, .dot, .dotx/ for documents and match .htm as well as .html) and use it as:
woff2?|ttf|xlsx?|pptx?|do(c|t)x?|svg|xml|css|js|php|html?|pdf|png|jpe?g|gif in the FilesMatch

@jonom
Copy link

jonom commented Feb 14, 2020

You could try something like this for a .htaccess-only solution, adjusting the regex to suit your needs. This example would deny crawling for site.com and beta.site.com but allow crawling on www.site.com.

# Only allow crawling of 'www.' urls
SetEnvIfNoCase Host "^www" crawlable
Header set X-Robots-Tag "noindex, nofollow" env=!crawlable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants