From 9c44ffcb957cad7924f94ffa613d513ba7b46530 Mon Sep 17 00:00:00 2001 From: Brian Thomas Smith Date: Tue, 3 Dec 2024 18:56:41 +0100 Subject: [PATCH 01/18] feat(http): Add X-Robots-Tag header --- .../en-us/web/html/element/meta/name/index.md | 2 +- .../web/http/headers/x-robots-tag/index.md | 175 ++++++++++++++++++ 2 files changed, 176 insertions(+), 1 deletion(-) create mode 100644 files/en-us/web/http/headers/x-robots-tag/index.md diff --git a/files/en-us/web/html/element/meta/name/index.md b/files/en-us/web/html/element/meta/name/index.md index 3d27923a43ce778..f7d3fa3bc656871 100644 --- a/files/en-us/web/html/element/meta/name/index.md +++ b/files/en-us/web/html/element/meta/name/index.md @@ -244,7 +244,7 @@ The [WHATWG Wiki MetaExtensions page](https://wiki.whatwg.org/wiki/MetaExtension > - The `robots` `` tag and `robots.txt` file serve different purposes: `robots.txt` controls the crawling of pages, and does not affect indexing or other behavior controlled by `robots` meta. A page that can't be crawled may still be indexed if it is referenced by another document. > - If you want to remove a page, `noindex` will work, but only after the robot visits the page again. Ensure that the `robots.txt` file is not preventing revisits. > - Some values are mutually exclusive, like `index` and `noindex`, or `follow` and `nofollow`. In these cases the robot's behavior is undefined and may vary between them. - > - Some crawler robots, like Google, Yahoo and Bing, support the same values for the HTTP header `X-Robots-Tag`; this allows non-HTML documents like images to use these rules. + > - Some crawler robots, like Google, Yahoo and Bing, support the same values for the HTTP header {{HTTPHeader("X-Robots-Tag")}}; this allows non-HTML documents like images to use these rules. diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md new file mode 100644 index 000000000000000..233a2d8f511110b --- /dev/null +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -0,0 +1,175 @@ +--- +title: X-Robots-Tag +slug: Web/HTTP/Headers/X-Robots-Tag +page-type: http-header +status: + - non-standard +--- + +{{HTTPSidebar}} + +The **`X-Robots-Tag`** {{Glossary("response header")}} is a de-facto standard header for requesting how {{glossary("Crawler", "crawlers")}} should index URLs. +Search-related crawlers use the rules from the `X-Robots-Tag` header to adjust how to present web pages or other resources in search results. + +Indexing rules defined via `` tags and `X-Robots-Tag` headers are discovered when a URL is crawled. +Specifying rules in a HTTP header is appropriate for non-HTML documents like images, PDFs, or other media. + +> [!NOTE] +> Only cooperative robots follow these rules and a crawler still needs to access the page in order to read these rules (see [Interaction with robots.txt](#interaction_with_robots.txt)). +> A {{Glossary("robots.txt")}} file is more appropriate to restrict or prevent bandwidth consumption by crawlers. + + + + + + + + + + + + +
Header type{{Glossary("Response header")}}
{{Glossary("Forbidden header name")}}No
+ +## Syntax + +```http +X-Robots-Tag: +``` + +## Directives + +- `` + + - : A comma-separated list of rules for indexing the resource at the current URL. + Any of the following rules may be used: + + - `all` + - : No restrictions for indexing or serving in search results. + This rule is the default value and has no effect if explicitly listed. + - `noindex` + - : Do not show this page, media, or resource in search results. + If you don't specify this rule, the page, media, or resource may be indexed and shown in search results. + - `nofollow` + - : Do not follow the links on this page. + If you don't specify this rule, search engines may use the links on the page to discover those linked pages. + - `none` + - : Equivalent to `noindex, nofollow`. + - `nosnippet` + - : Do not show a text snippet or video preview in the search results for this page. + A static image thumbnail (if available) may still be visible. + If you don't specify this rule, search engines may generate a text snippet and video preview based on information found on the page. + To exclude certain sections of your content from appearing in search result snippets, use the `data-nosnippet` HTML attribute. + - `indexifembedded` + - : A search engine is allowed to index the content of a page if it's embedded in another page through iframes or similar HTML tags, in spite of a `noindex` rule. + `indexifembedded` only has an effect if it's accompanied by `noindex`. + - `max-snippet: ` + - : Use a maximum of `` characters as a textual snippet for this search result. + Ignored if no valid `` is specified. + - `max-image-preview: ` + + - : The maximum size of an image preview for this page in a search results. + If you don't specify the `max-image-preview` rule, search engines may show an image preview of the default size. + If you don't want search engines to use larger thumbnail images, specify a `max-image-preview` value of `standard` or `none`. + + Accepted `` values: + + - `none` + - : No image preview is to be shown. + - `standard` + - : A default image preview may be shown. + - `large` + - : A larger image preview, up to the width of the viewport, may be shown. + + - `max-video-preview: ` + - : Use a maximum of `` seconds as a video snippet for videos on this page in search results. + If you don't specify the `max-video-preview` rule, search engines may show a video snippet in search results, and a search engines decide how long a preview may be. + Ignored if no valid `` is specified. + Special values are as follows: + - `0` + - : At most, a static image may be used, in accordance to the `max-image-preview` setting. + - `-1` + - : No video length limit. + - `notranslate` + - : Don't offer translation of this page in search results. + If you don't specify this rule, search engines may provide a translation of the title link and snippet of a result for results that aren't in the language of the search query. + - `noimageindex` + - : Do not index images on this page. + If you don't specify this value, images on the page may be indexed and shown in search results. + - `unavailable_after: ` + + - : Requests not to show this page in search results after the specified ``. + Ignored if no valid `` is specified. + A date must be specified in a format such as {{RFC("822")}}, {{RFC("850")}}, or ISO 8601. + + By default there is no expiration date for content. + If you don't specify this rule, this page may be shown in search results indefinitely. + Crawlers are expected to considerably decrease the crawl rate of the URL after the specified date and time. + +## Description + +Indexing rules via `` and `X-Robots-Tag` are discovered when a URL is crawled. +Most crawlers support any rule in a `X-Robots-Tag` HTTP header that can be used in a `` tag. + +In the case of conflicting robots rules, the more restrictive rule applies. +For example, if a page has both `max-snippet:50` and `nosnippet` rules, the `nosnippet` rule will apply. + +Some values are mutually exclusive, like `index` and `noindex`, or `follow` and `nofollow`. +In these cases the crawler's behavior is undefined and may vary. + +> [!NOTE] +> It's possible that `X-Robots-Tag` rules may not be treated the same by all search engines. + +### Interaction with robots.txt + +If a page is disallowed from crawling through a `robots.txt` file, then any information about indexing or serving rules specified using `` or the `X-Robots-Tag` HTTP header will not be detected and will therefore be ignored. + +A page that can't be crawled may still be indexed if it is referenced by another document. +If you want to remove a page from search indexes, `X-Robots-Tag: noindex` will typically work, but a robot must first revisit the page to detect the `X-Robots-Tag` rule. + +## Examples + +### Using X-Robots-Tag + +The following `X-Robots-Tag` header adds `noindex`, asking crawlers not to show this page, media, or resource in search results: + +```http +HTTP/1.1 200 OK +Date: Tue, 03 Dec 2024 17:08:49 GMT +X-Robots-Tag: noindex +``` + +### Multiple headers + +The following response has two `X-Robots-Tag` headers, each with an indexing rule specified: + +```http +HTTP/1.1 200 OK +Date: Tue, 03 Dec 2024 17:08:49 GMT +X-Robots-Tag: noimageindex +X-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST +``` + +### Specifying user agents + +It's possible to specify which user agent the rules should apply to. +The following example contains two `X-Robots-Tag` headers which ask that `googlebot` not follow the links on this page and that a fictional `BadBot` crawler not index the page or follow any links on it, either: + +```http +HTTP/1.1 200 OK +Date: Tue, 03 Dec 2024 17:08:49 GMT +X-Robots-Tag: googlebot: nofollow +X-Robots-Tag: BadBot: noindex, nofollow +``` + +## Specifications + +Not part of any current specification. + +## See also + +- {{HTTPHeader("Forwarded")}} +- {{HTTPHeader("X-Forwarded-For")}} +- {{Glossary("Search engine")}} +- {{RFC("9309", "Robots Exclusion Protocol")}} +- [Using the X-Robots-Tag HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag) on developers.google.com From 813fd769db684ac01f464889f0c2937fb32141bc Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Tue, 3 Dec 2024 19:02:53 +0100 Subject: [PATCH 02/18] Update files/en-us/web/http/headers/x-robots-tag/index.md --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 233a2d8f511110b..4d5eb9f8f6870b6 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -147,7 +147,7 @@ The following response has two `X-Robots-Tag` headers, each with an indexing rul HTTP/1.1 200 OK Date: Tue, 03 Dec 2024 17:08:49 GMT X-Robots-Tag: noimageindex -X-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST +X-Robots-Tag: unavailable_after: Wed, 03 Dec 2025 13:09:53 GMT ``` ### Specifying user agents From cf835d0b847bc5a10e9403e4221b0665c68ee60d Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 4 Dec 2024 11:22:17 +0100 Subject: [PATCH 03/18] Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl --- files/en-us/web/http/headers/x-robots-tag/index.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 4d5eb9f8f6870b6..cdac564c732c58f 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -69,10 +69,8 @@ X-Robots-Tag: - `max-image-preview: ` - : The maximum size of an image preview for this page in a search results. - If you don't specify the `max-image-preview` rule, search engines may show an image preview of the default size. - If you don't want search engines to use larger thumbnail images, specify a `max-image-preview` value of `standard` or `none`. - - Accepted `` values: + If omitted, search engines may show an image preview of the default size. + If you don't want search engines to use larger thumbnail images, specify a `max-image-preview` value of `standard` or `none`. Values include: - `none` - : No image preview is to be shown. From 766bf09f134b20c63a17d29dba5f4b35931d4037 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 4 Dec 2024 11:23:08 +0100 Subject: [PATCH 04/18] Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index cdac564c732c58f..90fd0566d35b6e8 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -90,7 +90,7 @@ X-Robots-Tag: - : No video length limit. - `notranslate` - : Don't offer translation of this page in search results. - If you don't specify this rule, search engines may provide a translation of the title link and snippet of a result for results that aren't in the language of the search query. + If omitted, search engines may translate the search result title and snippet into the language of the search query. - `noimageindex` - : Do not index images on this page. If you don't specify this value, images on the page may be indexed and shown in search results. From 65e6ab7ecad092fb5eb53844d905c0bd64ca12f0 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 4 Dec 2024 11:23:23 +0100 Subject: [PATCH 05/18] Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 90fd0566d35b6e8..1c482b294b89e1d 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -93,7 +93,7 @@ X-Robots-Tag: If omitted, search engines may translate the search result title and snippet into the language of the search query. - `noimageindex` - : Do not index images on this page. - If you don't specify this value, images on the page may be indexed and shown in search results. + If omitted, images on the page may be indexed and shown in search results. - `unavailable_after: ` - : Requests not to show this page in search results after the specified ``. From 23a668f535e8638da6be7bfc3e1d6a57f1e3967e Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 4 Dec 2024 11:25:36 +0100 Subject: [PATCH 06/18] Apply suggestions from code review Co-authored-by: Estelle Weyl --- files/en-us/web/http/headers/x-robots-tag/index.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 1c482b294b89e1d..abaebe3565c1e7a 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -107,7 +107,7 @@ X-Robots-Tag: ## Description Indexing rules via `` and `X-Robots-Tag` are discovered when a URL is crawled. -Most crawlers support any rule in a `X-Robots-Tag` HTTP header that can be used in a `` tag. +Most crawlers support rules in the `X-Robots-Tag` HTTP header that can be used in a `` tag. In the case of conflicting robots rules, the more restrictive rule applies. For example, if a page has both `max-snippet:50` and `nosnippet` rules, the `nosnippet` rule will apply. @@ -116,7 +116,7 @@ Some values are mutually exclusive, like `index` and `noindex`, or `follow` and In these cases the crawler's behavior is undefined and may vary. > [!NOTE] -> It's possible that `X-Robots-Tag` rules may not be treated the same by all search engines. +> The `X-Robots-Tag` rules may not be treated the same by all search engines. ### Interaction with robots.txt @@ -166,8 +166,6 @@ Not part of any current specification. ## See also -- {{HTTPHeader("Forwarded")}} -- {{HTTPHeader("X-Forwarded-For")}} - {{Glossary("Search engine")}} - {{RFC("9309", "Robots Exclusion Protocol")}} - [Using the X-Robots-Tag HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag) on developers.google.com From 11307531a52dd8814f6e3bb235e440c5e6b80e53 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Thu, 5 Dec 2024 17:29:23 +0100 Subject: [PATCH 07/18] Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index abaebe3565c1e7a..364bf55a18b7a75 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -8,7 +8,7 @@ status: {{HTTPSidebar}} -The **`X-Robots-Tag`** {{Glossary("response header")}} is a de-facto standard header for requesting how {{glossary("Crawler", "crawlers")}} should index URLs. +The **`X-Robots-Tag`** {{Glossary("response header")}} defines how {{glossary("Crawler", "crawlers")}} should index URLs. While not part of any specification, it is a de-facto standard method for communicating with search bots, web crawlers, and similar user agents. Search-related crawlers use the rules from the `X-Robots-Tag` header to adjust how to present web pages or other resources in search results. Indexing rules defined via `` tags and `X-Robots-Tag` headers are discovered when a URL is crawled. From 6c1bc485864f2b0040f17ad5d554a774d324d34f Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Thu, 5 Dec 2024 17:30:39 +0100 Subject: [PATCH 08/18] Update files/en-us/web/http/headers/x-robots-tag/index.md --- files/en-us/web/http/headers/x-robots-tag/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 364bf55a18b7a75..98c105dd0284fbf 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -15,8 +15,8 @@ Indexing rules defined via `` tags and `X-Robots-Tag` header Specifying rules in a HTTP header is appropriate for non-HTML documents like images, PDFs, or other media. > [!NOTE] -> Only cooperative robots follow these rules and a crawler still needs to access the page in order to read these rules (see [Interaction with robots.txt](#interaction_with_robots.txt)). -> A {{Glossary("robots.txt")}} file is more appropriate to restrict or prevent bandwidth consumption by crawlers. +> Only cooperative robots follow these rules, and a crawler still needs to access the page to read them (see [Interaction with robots.txt](#interaction_with_robots.txt)). +> If you want to prevent bandwidth consumption by crawlers, a restrictive {{Glossary("robots.txt")}} file is more effective than indexing rules in the HTTP header or meta tags. From 725ffb7288f053bac3619ef9b9ad0b87e3c94208 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Thu, 5 Dec 2024 17:39:58 +0100 Subject: [PATCH 09/18] Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl --- files/en-us/web/http/headers/x-robots-tag/index.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 98c105dd0284fbf..8d8b1ac358514dd 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -109,7 +109,8 @@ X-Robots-Tag: Indexing rules via `` and `X-Robots-Tag` are discovered when a URL is crawled. Most crawlers support rules in the `X-Robots-Tag` HTTP header that can be used in a `` tag. -In the case of conflicting robots rules, the more restrictive rule applies. +In the case of conflicting robot rules within the `X-Robots-Tag` or between the `X-Robots-Tag` HTTP header and the `` tag, the more restrictive rule applies. +Neither's rules will apply if [blocked from being read](#interaction_with_robotstxt) by a `robots.txt` file with a `noindex, nofollow` or `none`. For example, if a page has both `max-snippet:50` and `nosnippet` rules, the `nosnippet` rule will apply. Some values are mutually exclusive, like `index` and `noindex`, or `follow` and `nofollow`. From db796f87429dc57a8b1ffb4f379e8dffbffe06e0 Mon Sep 17 00:00:00 2001 From: Brian Thomas Smith Date: Fri, 6 Dec 2024 11:08:44 +0100 Subject: [PATCH 10/18] feat(http): X-Robots-Tag header, robots.txt --- files/en-us/glossary/robots.txt/index.md | 18 +- .../web/http/headers/x-robots-tag/index.md | 164 ++++++++++-------- 2 files changed, 102 insertions(+), 80 deletions(-) diff --git a/files/en-us/glossary/robots.txt/index.md b/files/en-us/glossary/robots.txt/index.md index f8c59e870d7dcf3..a67b7d419fdc713 100644 --- a/files/en-us/glossary/robots.txt/index.md +++ b/files/en-us/glossary/robots.txt/index.md @@ -6,13 +6,21 @@ page-type: glossary-definition {{GlossarySidebar}} -Robots.txt is a file which is usually placed in the root of any website. It decides whether {{Glossary("crawler", "crawlers")}} are permitted or forbidden access to the website. +A **robots.txt** is a file which is usually placed in the root of a website (for example, `https://www.example.com/robots.txt`). +It specifies whether {{Glossary("crawler", "crawlers")}} are allowed or disallowed from accessing an entire website or to certain resources on a website. +A restrictive `robots.txt` file can prevent bandwidth consumption by crawlers. -For example, the site admin can forbid crawlers to visit a certain folder (and all the files therein contained) or to crawl a specific file, usually to prevent those files being indexed by other search engines. +A site owner can forbid crawlers to detect a certain path (and all files in that path) or a specific file. +This is often done to prevent these resources from being indexed or served by search engines. + +If a crawler is allowed to access resources, you can define [indexing rules](/en-US/docs/Web/HTTP/Headers/X-Robots-Tag#directives) for those resources via `` tags and {{HTTPHeader("X-Robots-Tag")}} HTTP headers. +Search-related crawlers use these rules to determine how to index and serve resources in search results, or to adjust the crawl rate for specific resources over time. ## See also +- {{HTTPHeader("X-Robots-Tag")}} +- {{Glossary("Search engine")}} +- {{RFC("9309", "Robots Exclusion Protocol")}} +- [How Google interprets the robots.txt specification](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt) on developers.google.com +- https://www.robotstxt.org - [Robots.txt](https://en.wikipedia.org/wiki/Robots.txt) on Wikipedia -- -- Standard specification: [RFC9309](https://www.rfc-editor.org/rfc/rfc9309.html) -- diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 8d8b1ac358514dd..fd1ba6910ade797 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -8,15 +8,16 @@ status: {{HTTPSidebar}} -The **`X-Robots-Tag`** {{Glossary("response header")}} defines how {{glossary("Crawler", "crawlers")}} should index URLs. While not part of any specification, it is a de-facto standard method for communicating with search bots, web crawlers, and similar user agents. +The **`X-Robots-Tag`** {{Glossary("response header")}} defines how {{glossary("Crawler", "crawlers")}} should index URLs. +While not part of any specification, it is a de-facto standard method for communicating with search bots, web crawlers, and similar user agents. Search-related crawlers use the rules from the `X-Robots-Tag` header to adjust how to present web pages or other resources in search results. Indexing rules defined via `` tags and `X-Robots-Tag` headers are discovered when a URL is crawled. -Specifying rules in a HTTP header is appropriate for non-HTML documents like images, PDFs, or other media. +Specifying indexing rules in a HTTP header is useful for non-HTML documents like images, PDFs, or other media. > [!NOTE] -> Only cooperative robots follow these rules, and a crawler still needs to access the page to read them (see [Interaction with robots.txt](#interaction_with_robots.txt)). -> If you want to prevent bandwidth consumption by crawlers, a restrictive {{Glossary("robots.txt")}} file is more effective than indexing rules in the HTTP header or meta tags. +> Only cooperative robots follow these rules, and a crawler still needs to access the resource to read them (see [Interaction with robots.txt](#interaction_with_robots.txt)). +> If you want to prevent bandwidth consumption by crawlers, a restrictive {{Glossary("robots.txt")}} file is more effective than indexing rules as it blocks resources from being crawled entirely.
@@ -33,76 +34,82 @@ Specifying rules in a HTTP header is appropriate for non-HTML documents like ima ## Syntax +One or more indexing rules as a comma-separated list: + ```http -X-Robots-Tag: +X-Robots-Tag: +X-Robots-Tag: , , … ``` -## Directives +An optional `:` specifies the user agent that the subsequent rules should apply to: -- `` +```http +X-Robots-Tag: , : +X-Robots-Tag: : , , … +``` - - : A comma-separated list of rules for indexing the resource at the current URL. - Any of the following rules may be used: +See [Specifying user agents](#specifying_user_agents) for an example. - - `all` - - : No restrictions for indexing or serving in search results. - This rule is the default value and has no effect if explicitly listed. - - `noindex` - - : Do not show this page, media, or resource in search results. - If you don't specify this rule, the page, media, or resource may be indexed and shown in search results. - - `nofollow` - - : Do not follow the links on this page. - If you don't specify this rule, search engines may use the links on the page to discover those linked pages. +## Directives + +Any of the following indexing rules may be used: + +- `all` + - : No restrictions for indexing or serving in search results. + This rule is the default value and has no effect if explicitly listed. +- `noindex` + - : Do not show this page, media, or resource in search results. + If omitted, the page, media, or resource may be indexed and shown in search results. +- `nofollow` + - : Do not follow the links on this page. + If omitted, search engines may use the links on the page to discover those linked pages. +- `none` + - : Equivalent to `noindex, nofollow`. +- `nosnippet` + - : Do not show a text snippet or video preview in the search results for this page. + A static image thumbnail (if available) may still be visible. + If omitted, search engines may generate a text snippet and video preview based on information found on the page. + To exclude certain sections of your content from appearing in search result snippets, use the [`data-nosnippet` HTML attribute](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#data-nosnippet-attr). +- `indexifembedded` + - : A search engine is allowed to index the content of a page if it's embedded in another page through iframes or similar HTML tags, in spite of a `noindex` rule. + `indexifembedded` only has an effect if it's accompanied by `noindex`. +- `max-snippet: ` + - : Use a maximum of `` characters as a textual snippet for this search result. + Ignored if no valid `` is specified. +- `max-image-preview: ` + - : The maximum size of an image preview for this page in a search results. + If omitted, search engines may show an image preview of the default size. + If you don't want search engines to use larger thumbnail images, specify a `max-image-preview` value of `standard` or `none`. Values include: - `none` - - : Equivalent to `noindex, nofollow`. - - `nosnippet` - - : Do not show a text snippet or video preview in the search results for this page. - A static image thumbnail (if available) may still be visible. - If you don't specify this rule, search engines may generate a text snippet and video preview based on information found on the page. - To exclude certain sections of your content from appearing in search result snippets, use the `data-nosnippet` HTML attribute. - - `indexifembedded` - - : A search engine is allowed to index the content of a page if it's embedded in another page through iframes or similar HTML tags, in spite of a `noindex` rule. - `indexifembedded` only has an effect if it's accompanied by `noindex`. - - `max-snippet: ` - - : Use a maximum of `` characters as a textual snippet for this search result. - Ignored if no valid `` is specified. - - `max-image-preview: ` - - - : The maximum size of an image preview for this page in a search results. - If omitted, search engines may show an image preview of the default size. - If you don't want search engines to use larger thumbnail images, specify a `max-image-preview` value of `standard` or `none`. Values include: - - - `none` - - : No image preview is to be shown. - - `standard` - - : A default image preview may be shown. - - `large` - - : A larger image preview, up to the width of the viewport, may be shown. - - - `max-video-preview: ` - - : Use a maximum of `` seconds as a video snippet for videos on this page in search results. - If you don't specify the `max-video-preview` rule, search engines may show a video snippet in search results, and a search engines decide how long a preview may be. - Ignored if no valid `` is specified. - Special values are as follows: - - `0` - - : At most, a static image may be used, in accordance to the `max-image-preview` setting. - - `-1` - - : No video length limit. - - `notranslate` - - : Don't offer translation of this page in search results. - If omitted, search engines may translate the search result title and snippet into the language of the search query. - - `noimageindex` - - : Do not index images on this page. - If omitted, images on the page may be indexed and shown in search results. - - `unavailable_after: ` - - - : Requests not to show this page in search results after the specified ``. - Ignored if no valid `` is specified. - A date must be specified in a format such as {{RFC("822")}}, {{RFC("850")}}, or ISO 8601. - - By default there is no expiration date for content. - If you don't specify this rule, this page may be shown in search results indefinitely. - Crawlers are expected to considerably decrease the crawl rate of the URL after the specified date and time. + - : No image preview is to be shown. + - `standard` + - : A default image preview may be shown. + - `large` + - : A larger image preview, up to the width of the viewport, may be shown. +- `max-video-preview: ` + - : Use a maximum of `` seconds as a video snippet for videos on this page in search results. + If omitted, search engines may show a video snippet in search results, and the search engine decides how long a preview may be. + Ignored if no valid `` is specified. + Special values are as follows: + - `0` + - : At most, a static image may be used, in accordance to the `max-image-preview` setting. + - `-1` + - : No video length limit. +- `notranslate` + - : Don't offer translation of this page in search results. + If omitted, search engines may translate the search result title and snippet into the language of the search query. +- `noimageindex` + - : Do not index images on this page. + If omitted, images on the page may be indexed and shown in search results. +- `unavailable_after: ` + + - : Requests not to show this page in search results after the specified ``. + Ignored if no valid `` is specified. + A date must be specified in a format such as {{RFC("822")}}, {{RFC("850")}}, or ISO 8601. + + By default there is no expiration date for content. + If omitted, this page may be shown in search results indefinitely. + Crawlers are expected to considerably decrease the crawl rate of the URL after the specified date and time. ## Description @@ -110,20 +117,17 @@ Indexing rules via `` and `X-Robots-Tag` are discovered when Most crawlers support rules in the `X-Robots-Tag` HTTP header that can be used in a `` tag. In the case of conflicting robot rules within the `X-Robots-Tag` or between the `X-Robots-Tag` HTTP header and the `` tag, the more restrictive rule applies. -Neither's rules will apply if [blocked from being read](#interaction_with_robotstxt) by a `robots.txt` file with a `noindex, nofollow` or `none`. For example, if a page has both `max-snippet:50` and `nosnippet` rules, the `nosnippet` rule will apply. +Indexing rules are not applied if paths are blocked from being crawled by a `robots.txt` file. -Some values are mutually exclusive, like `index` and `noindex`, or `follow` and `nofollow`. -In these cases the crawler's behavior is undefined and may vary. - -> [!NOTE] -> The `X-Robots-Tag` rules may not be treated the same by all search engines. +Some values are mutually exclusive, such as `index` and `noindex`, or `follow` and `nofollow`. +In these cases, the crawler's behavior is undefined and may vary. ### Interaction with robots.txt -If a page is disallowed from crawling through a `robots.txt` file, then any information about indexing or serving rules specified using `` or the `X-Robots-Tag` HTTP header will not be detected and will therefore be ignored. +If a resource is blocked from crawling through a `robots.txt` file, then any information about indexing or serving rules specified using `` or the `X-Robots-Tag` HTTP header will not be detected and will therefore be ignored. -A page that can't be crawled may still be indexed if it is referenced by another document. +A page that's blocked from crawling may still be indexed if it is referenced from another document (see the [`nofollow`](#nofollow) directive). If you want to remove a page from search indexes, `X-Robots-Tag: noindex` will typically work, but a robot must first revisit the page to detect the `X-Robots-Tag` rule. ## Examples @@ -161,12 +165,22 @@ X-Robots-Tag: googlebot: nofollow X-Robots-Tag: BadBot: noindex, nofollow ``` +In the response below, the same indexing rules are defined, but in a single header. +Each indexing rule applies to the user agent specified behind it: + +```http +HTTP/1.1 200 OK +Date: Tue, 03 Dec 2024 17:08:49 GMT +X-Robots-Tag: BadBot: noindex, nofollow, googlebot: nofollow +``` + ## Specifications Not part of any current specification. ## See also +- {{Glossary("Robots.txt")}} - {{Glossary("Search engine")}} - {{RFC("9309", "Robots Exclusion Protocol")}} - [Using the X-Robots-Tag HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag) on developers.google.com From 875b9b383ff9946325f73446b6625283a78ed960 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Fri, 6 Dec 2024 11:22:57 +0100 Subject: [PATCH 11/18] Update files/en-us/web/http/headers/x-robots-tag/index.md --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index fd1ba6910ade797..06ec63c0b8e331e 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -118,7 +118,7 @@ Most crawlers support rules in the `X-Robots-Tag` HTTP header that can be used i In the case of conflicting robot rules within the `X-Robots-Tag` or between the `X-Robots-Tag` HTTP header and the `` tag, the more restrictive rule applies. For example, if a page has both `max-snippet:50` and `nosnippet` rules, the `nosnippet` rule will apply. -Indexing rules are not applied if paths are blocked from being crawled by a `robots.txt` file. +Indexing rules won't be discovered or applied if paths are blocked from being crawled by a `robots.txt` file. Some values are mutually exclusive, such as `index` and `noindex`, or `follow` and `nofollow`. In these cases, the crawler's behavior is undefined and may vary. From 4874024146982d8d0e35fe02a2f9b1c4a5a74295 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Mon, 9 Dec 2024 11:33:44 +0100 Subject: [PATCH 12/18] Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 06ec63c0b8e331e..e003771f7254d4c 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -16,7 +16,7 @@ Indexing rules defined via `` tags and `X-Robots-Tag` header Specifying indexing rules in a HTTP header is useful for non-HTML documents like images, PDFs, or other media. > [!NOTE] -> Only cooperative robots follow these rules, and a crawler still needs to access the resource to read them (see [Interaction with robots.txt](#interaction_with_robots.txt)). +> Only cooperative robots follow these rules, and a crawler still needs to access the resource to read headers and meta tags (see [Interaction with robots.txt](#interaction_with_robots.txt)). > If you want to prevent bandwidth consumption by crawlers, a restrictive {{Glossary("robots.txt")}} file is more effective than indexing rules as it blocks resources from being crawled entirely.
From 2cd309a78d6818bf8c5773d8c5bc1421d12c32c0 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 11 Dec 2024 12:36:37 +0100 Subject: [PATCH 13/18] Apply suggestions from code review Co-authored-by: Vadim Makeev --- files/en-us/glossary/robots.txt/index.md | 2 +- files/en-us/web/html/element/meta/name/index.md | 2 +- files/en-us/web/http/headers/x-robots-tag/index.md | 8 ++++---- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/files/en-us/glossary/robots.txt/index.md b/files/en-us/glossary/robots.txt/index.md index a67b7d419fdc713..0f8d22c657c6f32 100644 --- a/files/en-us/glossary/robots.txt/index.md +++ b/files/en-us/glossary/robots.txt/index.md @@ -13,7 +13,7 @@ A restrictive `robots.txt` file can prevent bandwidth consumption by crawlers. A site owner can forbid crawlers to detect a certain path (and all files in that path) or a specific file. This is often done to prevent these resources from being indexed or served by search engines. -If a crawler is allowed to access resources, you can define [indexing rules](/en-US/docs/Web/HTTP/Headers/X-Robots-Tag#directives) for those resources via `` tags and {{HTTPHeader("X-Robots-Tag")}} HTTP headers. +If a crawler is allowed to access resources, you can define [indexing rules](/en-US/docs/Web/HTTP/Headers/X-Robots-Tag#directives) for those resources via `` elements and {{HTTPHeader("X-Robots-Tag")}} HTTP headers. Search-related crawlers use these rules to determine how to index and serve resources in search results, or to adjust the crawl rate for specific resources over time. ## See also diff --git a/files/en-us/web/html/element/meta/name/index.md b/files/en-us/web/html/element/meta/name/index.md index f7d3fa3bc656871..f7ad03baa657923 100644 --- a/files/en-us/web/html/element/meta/name/index.md +++ b/files/en-us/web/html/element/meta/name/index.md @@ -241,7 +241,7 @@ The [WHATWG Wiki MetaExtensions page](https://wiki.whatwg.org/wiki/MetaExtension > > - Only cooperative robots follow these rules. Do not expect to prevent email harvesters with them. > - The robot still needs to access the page in order to read these rules. To prevent bandwidth consumption, consider if using a _{{Glossary("robots.txt")}}_ file is more appropriate. - > - The `robots` `` tag and `robots.txt` file serve different purposes: `robots.txt` controls the crawling of pages, and does not affect indexing or other behavior controlled by `robots` meta. A page that can't be crawled may still be indexed if it is referenced by another document. + > - The `` element and `robots.txt` file serve different purposes: `robots.txt` controls the crawling of pages, and does not affect indexing or other behavior controlled by `robots` meta. A page that can't be crawled may still be indexed if it is referenced by another document. > - If you want to remove a page, `noindex` will work, but only after the robot visits the page again. Ensure that the `robots.txt` file is not preventing revisits. > - Some values are mutually exclusive, like `index` and `noindex`, or `follow` and `nofollow`. In these cases the robot's behavior is undefined and may vary between them. > - Some crawler robots, like Google, Yahoo and Bing, support the same values for the HTTP header {{HTTPHeader("X-Robots-Tag")}}; this allows non-HTML documents like images to use these rules. diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index e003771f7254d4c..206dcd31c33e264 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -12,7 +12,7 @@ The **`X-Robots-Tag`** {{Glossary("response header")}} defines how {{glossary("C While not part of any specification, it is a de-facto standard method for communicating with search bots, web crawlers, and similar user agents. Search-related crawlers use the rules from the `X-Robots-Tag` header to adjust how to present web pages or other resources in search results. -Indexing rules defined via `` tags and `X-Robots-Tag` headers are discovered when a URL is crawled. +Indexing rules defined via `` elements and `X-Robots-Tag` headers are discovered when a URL is crawled. Specifying indexing rules in a HTTP header is useful for non-HTML documents like images, PDFs, or other media. > [!NOTE] @@ -71,7 +71,7 @@ Any of the following indexing rules may be used: If omitted, search engines may generate a text snippet and video preview based on information found on the page. To exclude certain sections of your content from appearing in search result snippets, use the [`data-nosnippet` HTML attribute](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#data-nosnippet-attr). - `indexifembedded` - - : A search engine is allowed to index the content of a page if it's embedded in another page through iframes or similar HTML tags, in spite of a `noindex` rule. + - : A search engine is allowed to index the content of a page if it's embedded in another page through iframes or similar HTML elements, in spite of a `noindex` rule. `indexifembedded` only has an effect if it's accompanied by `noindex`. - `max-snippet: ` - : Use a maximum of `` characters as a textual snippet for this search result. @@ -114,9 +114,9 @@ Any of the following indexing rules may be used: ## Description Indexing rules via `` and `X-Robots-Tag` are discovered when a URL is crawled. -Most crawlers support rules in the `X-Robots-Tag` HTTP header that can be used in a `` tag. +Most crawlers support rules in the `X-Robots-Tag` HTTP header that can be used in a `` element. -In the case of conflicting robot rules within the `X-Robots-Tag` or between the `X-Robots-Tag` HTTP header and the `` tag, the more restrictive rule applies. +In the case of conflicting robot rules within the `X-Robots-Tag` or between the `X-Robots-Tag` HTTP header and the `` element, the more restrictive rule applies. For example, if a page has both `max-snippet:50` and `nosnippet` rules, the `nosnippet` rule will apply. Indexing rules won't be discovered or applied if paths are blocked from being crawled by a `robots.txt` file. From 77a5501a3845632f9b202f54698f1b2c2beba33e Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 11 Dec 2024 13:47:52 +0100 Subject: [PATCH 14/18] Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Vadim Makeev --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 206dcd31c33e264..51c1970e3070e19 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -16,7 +16,7 @@ Indexing rules defined via `` elements and `X-Robots-Tag` he Specifying indexing rules in a HTTP header is useful for non-HTML documents like images, PDFs, or other media. > [!NOTE] -> Only cooperative robots follow these rules, and a crawler still needs to access the resource to read headers and meta tags (see [Interaction with robots.txt](#interaction_with_robots.txt)). +> Only cooperative robots follow these rules, and a crawler still needs to access the resource to read headers and meta elements (see [Interaction with robots.txt](#interaction_with_robots.txt)). > If you want to prevent bandwidth consumption by crawlers, a restrictive {{Glossary("robots.txt")}} file is more effective than indexing rules as it blocks resources from being crawled entirely.
From 02ce30be35e06aed16bea0ce5172f7535ebdd704 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 11 Dec 2024 13:48:26 +0100 Subject: [PATCH 15/18] Update files/en-us/web/http/headers/x-robots-tag/index.md --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 51c1970e3070e19..4b5451ca7721ad3 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -38,7 +38,7 @@ One or more indexing rules as a comma-separated list: ```http X-Robots-Tag: -X-Robots-Tag: , , … +X-Robots-Tag: , …, ``` An optional `:` specifies the user agent that the subsequent rules should apply to: From 3ed0b48d23d042c968b9b89bb165633a4f17c270 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 11 Dec 2024 13:49:21 +0100 Subject: [PATCH 16/18] Update files/en-us/web/http/headers/x-robots-tag/index.md --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 4b5451ca7721ad3..e8d5d20006b566c 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -45,7 +45,7 @@ An optional `:` specifies the user agent that the subsequent rules sho ```http X-Robots-Tag: , : -X-Robots-Tag: : , , … +X-Robots-Tag: : , …, ``` See [Specifying user agents](#specifying_user_agents) for an example. From 5eb2d4ce67f60578534e61ac6f53e8aa658b2043 Mon Sep 17 00:00:00 2001 From: Brian Thomas Smith Date: Wed, 11 Dec 2024 13:54:20 +0100 Subject: [PATCH 17/18] chore(http): improvements following reviewer feedback --- files/en-us/web/http/headers/x-robots-tag/index.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index e8d5d20006b566c..1e1cf2ef838a270 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -174,6 +174,16 @@ Date: Tue, 03 Dec 2024 17:08:49 GMT X-Robots-Tag: BadBot: noindex, nofollow, googlebot: nofollow ``` +For situations where multiple crawlers are specified along with different rules, the search engine will use the sum of the negative rules. +For example: + +```http +X-Robots-Tag: nofollow +X-Robots-Tag: googlebot: noindex +``` + +The page containing these headers will be interpreted as having a `noindex, nofollow` rule when crawled by `googlebot`. + ## Specifications Not part of any current specification. From f2b548e015376c8ced8e19250b8a1721187d96e3 Mon Sep 17 00:00:00 2001 From: Brian Thomas Smith Date: Wed, 11 Dec 2024 13:56:22 +0100 Subject: [PATCH 18/18] chore(http): improvements following reviewer feedback --- files/en-us/web/http/headers/x-robots-tag/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md index 1e1cf2ef838a270..029138228b3329f 100644 --- a/files/en-us/web/http/headers/x-robots-tag/index.md +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -161,8 +161,8 @@ The following example contains two `X-Robots-Tag` headers which ask that `google ```http HTTP/1.1 200 OK Date: Tue, 03 Dec 2024 17:08:49 GMT -X-Robots-Tag: googlebot: nofollow X-Robots-Tag: BadBot: noindex, nofollow +X-Robots-Tag: googlebot: nofollow ``` In the response below, the same indexing rules are defined, but in a single header.