Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Help search engines find latest version of a crate #1438

Open
jsha opened this issue Jun 25, 2021 · 42 comments
Open

Proposal: Help search engines find latest version of a crate #1438

jsha opened this issue Jun 25, 2021 · 42 comments

Comments

@jsha
Copy link
Contributor

jsha commented Jun 25, 2021

Motivation and Summary

When someone uses a web search engine to find a crate’s documentation, they are likely to wind up on the documentation for a random older version of that crate. This can be confusing and frustrating if they are working on a project that depends on a more recent version of that crate. As an example: in April 2021, a Google search for [rustls serversession] links to version 0.5.5 of that crate, released Feb 2017. A Google search for [rustls clientsession] links to version 0.11.0, released Jan 2019. The latest version is 0.19.0, released Nov 2020.

To fix this, I propose that doc.rs’s URL structure should be more like crates.io: Each crate should have an unversioned URL (docs.rs/rustls/latest) that always shows the docs for the latest version of that crate. There would continue to be versioned URLs like today (https://docs.rs/rustls/0.19.0/rustls/), accessible as defined below. I believe this will, over time, lead search engines to more often find the unversioned URL.

This is a popular request:

https://github.com/rust-lang/docs.rs/issues/1006
https://github.com/rust-lang/docs.rs/issues/854
https://github.com/rust-lang/docs.rs/issues/74
https://github.com/rust-lang/docs.rs/issues/1411

It's also a problem that disproportionately affects new users of the language who haven't gotten in the habit of looking for the "go to latest version" link. I know when I was first learning Rust, this problem was a particular headache for me.

Non-working solutions

<link rel=canonical> is a commonly proposed solution, but it’s not the right fit:

The target (canonical) IRI MUST identify content that is either
duplicative or a superset of the content at the context (referring)
IRI.

Since documentation of different versions is not duplicative, this won’t work. And in fact search engines verify the property, and will disregard canonical links on a site if it does not hold.

Here are some links about Google’s handling of canonical:

https://developers.google.com/search/docs/advanced/guidelines/duplicate-content
https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls

Proposed Solution

For any given crate, https://docs.rs/<crate>/latest should exist and not be a redirect. It should serve the latest version of that crate. Crates.io should be updated so that the unversioned URL for a crate (e.g. https://crates.io/crates/ureq) links to the unversioned URL on docs.rs.

Sometimes people will want to link to a specific version of the documentation rather than the generic “latest” URL. There will be two ways to do that:

  • You can navigate explicitly to the versioned URL, using the version selector tools already available in docs.rs.
  • Additionally, we should follow GitHub’s UI for when you click on a line number in a source file. When you click on an anchor within a doc page, a little “...” should appear with options to either copy the line or copy a permalink. We may even want to mimic the keystroke to go to a permalink for a page (y).

Caching issues

Currently, only static files are cached. The things that change between versions of a crate are its HTML and some JS (containing JSON data used in displaying the pages). The HTML is currently not cached at all, so invalidating its cache is not a current concern. The JS is also not cached, but it has a unique URL per crate version so easily could be cached.

In case we later decide to start caching the HTML: The CloudFront rates for cache invalidation are reasonable: $0.005 per invalidation request, and purging a whole subdirectory (like /crates/<crate>/*) is considered a single invalidation request.

Will it work?

I’m pretty sure it will work. Search engines these days take navigation events heavily into account, so if most navigation events go to an unversioned URL, that will help a lot. Also, once we make this change, the unversioned URLs will start accumulating “link juice,” which will also help a lot.

One good demonstration that it will work is that crates.io already follows a scheme like this, and does not have the "links go to old versions" problem at all.

@pietroalbini
Copy link
Member

The only reservation I have is the URL scheme: I'd prefer to have /<crate>/latest instead of /crates/<crate>, as it's consistent with what we have today and doesn't break the workflow of just typing docs.rs/crate-name in the URL bar. Otherwise I'm all for it!

@jsha
Copy link
Contributor Author

jsha commented Jun 25, 2021

Ah, yep, that URL scheme does make more sense for those reasons. Updated the proposal.

@jsha
Copy link
Contributor Author

jsha commented Oct 7, 2021

@rust-lang/docs-rs Does this approval sound good? From the above, @pietroalbini is for it but I'd like to get a little more buy-in from the team before I start work on it.

@syphar
Copy link
Member

syphar commented Oct 7, 2021

Hi @jsha , thanks for doing and pushing this!

I general I only have superficial knowledge of SEO / search engines, so really cannot judge on the best approach to optimize search results.

Still some thoughts:

  • perhaps I'm missing something, but why don't we just index the latest version, and set rel=nofollow for older versions? Of course we could do this later if link juice alone doesn't help. Also, where would link juice come from if the results don't appear high enough in the search results?
  • due to CSP we cannot cache the HTML anyway, so nothing we have to think about right now. But in general you're right, it's only about active invalidation at the right points. ( I have some ideas to solve this, but that would involve some more changes)
  • where should our sitemap point to? Base-URL or /latest/?

@jsha
Copy link
Contributor Author

jsha commented Oct 7, 2021

perhaps I'm missing something, but why don't we just index the latest version, and set rel=nofollow for older versions?

A lot of the most important links, in terms of ranking, come from outside docs.rs. For instance, crates.io, GitHub, and pages discussing a crate. We don't control those directly so we can't stick rel=nofollow on links that go to older versions. Also, even if we could control them, we'd have to update all of them whenever a new version is released, to attach rel=nofollow to any links to the old version's URL.

Also, search engines no longer rank solely or even primarily based on links - real user navigations (measured various ways) count for a lot too. If users consistently navigate to the "latest" URL, that helps indicate that URL is more important.

Think of it this way: imagine each inbound link counts as 1 point and each navigation counts as 1 point. We currently have something like:

/foo/0.1.0: 5 points
/foo/0.2.0: 10 points
/foo/0.2.1: 2 points

We'd really rather have:

/foo/latest: 17 points

Also, where would link juice come from if the results don't appear high enough in the search results?

I'm not totally sure I understand the question, but I think it's: How will search engines find older crates? They'll be linked from https://docs.rs/crate/ureq/2.2.0.

where should our sitemap point to? Base-URL or /latest/?

/latest/

@jsha
Copy link
Contributor Author

jsha commented Jul 5, 2022

The change to /latest/ pages has been out for a while, and I still see Google returning old results for some queries that have good results available in /latest/. For instance, rustls streamowned returns links to 0.14.0 and 0.16.0, when 0.20.6 is the latest.

I now suspect there may be issues with crawling, where crates are often not crawled deeply enough because there are so many crates, and so many pages per crate. I suspect the Google Search Console would let me dig deeper and find that out. Would the docs.rs team be willing to authorize me to look at Search Console data for docs.rs? It would involve putting a TXT record with google-site-verification=XXXXXX in it or serving a specific HTML file at the root of the domain.

@jyn514
Copy link
Member

jyn514 commented Jul 14, 2022

from @pietroalbini:

I have no problem with giving them access to the webmaster console, but we should do it by having the infrastructure team setup the account and then granting jsha access to it, rather than having jsha setup their own account linked to docs.rs. I'll add that to my todo list.

@pietroalbini
Copy link
Member

Done.

@pietroalbini pietroalbini removed their assignment Jul 14, 2022
@jsha
Copy link
Contributor Author

jsha commented Jul 14, 2022

Thanks! Here's our coverage report: 1.26M valid URLs, 3.28M excluded. The top reasons are "Crawled - currently not indexed" and "Duplicate without user-selected canonical." Details on those reasons here: https://support.google.com/webmasters/answer/7440203#duplicate_page_without_canonical_tag

image

Drilling down, here is a sample of "Crawled - currently not indexed" URLs:

image

And "Duplicate without user-selected canonical":

image

We can drill down into individual URLs:

image

image

What I conclude from this first look is that we probably do need to implement rel="canonical". In its absence, Google is selecting a canonical page for us, and often getting it wrong, thus excluding the page we would like to have indexed. I'll make a proposal on #74.

@alecmocatta
Copy link

What I conclude from this first look is that we probably do need to implement rel="canonical". In its absence, Google is selecting a canonical page for us, and often getting it wrong, thus excluding the page we would like to have indexed. I'll make a proposal on #74.

A couple things that might be worth trying prior to the sledgehammers of rel=canonical or noindex:

  • You could include every https://docs.rs/:crate/latest/:crate/** url in the sitemap.xml, not just the crate root. I would expect that to guide Google to the right canonical URL more often, though it increases the size and complexity of the sitemap.

  • Setting the HTTP Last-Modified header on pages. This is currently passed for crate roots via the sitemap (from some brief googling, outdated crate roots seem to be less of a problem?). But for everything else Google might well be inferring it, potentially wrongly?

  • Marking all links to specific versions (e.g. the dependencies and versions in the crate overview dropdown, links in doc comments or due to re-exporting) as rel=nofollow. This is just a hint that might minimise "passing along ranking credit to another page", but perhaps worth a shot.

@syphar
Copy link
Member

syphar commented Jul 24, 2022

Thank you for the ideas!

I don't know enough about google indexing to judge the ideas, but I have one remark about :

  • You could include every https://docs.rs/:crate/latest/:crate/** url in the sitemap.xml, not just the crate root. I would expect that to guide Google to the right canonical URL more often, though it increases the size and complexity of the sitemap.

This would definitely increase the sitemap's size quite much, we have crates with > 1 Mio files ( which are probably quite many pages to be added to the sitemap).

  • Marking all links to specific versions (e.g. the dependencies and versions in the crate overview dropdown, links in doc comments or due to re-exporting) as rel=nofollow. This is just a hint that might minimise "passing along ranking credit to another page", but perhaps worth a shot.

I remember talking to @jyn514 about this. Only concern I would see is someone searching for an element which only exists in an old version of the library.

@jsha
Copy link
Contributor Author

jsha commented Jul 24, 2022 via email

@jyn514
Copy link
Member

jyn514 commented Jul 24, 2022

I frankly don't remember anything about this problem 😅 happy to go with whatever you decide on.

@syphar
Copy link
Member

syphar commented Jul 24, 2022

I like the idea of nofollow in the dependencies links. That could reduce unnecessary crawling too.

@jsha @alecmocatta was also talking about setting old versions to nofollow, not only the dependencies.

With a hint which way helps I'm happy to help too. Which links shouldn't be followed?

If we fully exclude old versions that's probably a different discussion since we would completely exclude them from the index.

This would definitely increase the sitemap's size quite much, we have crates with > 1 Mio files ( which are probably quite many pages to be added to the sitemap).
We could limit the links per crate to reduce this problem. Another problem is that sitemaps are limited to 50k URLs. We have a sitemap index linking to various sitemaps, but that can only be nested one deep. Still, I like this idea. A related one: link to the crate root and all.html from the sitemap.

Thinking longer about this, this is a tough nut to crack. We don't have the generated docs in the database but only on S3. So generating the pages for a crate would involve an additional request to S3 for each crate.

I have some ideas how to solve this, but it's only worth the effort as a last option, if I'm not missing something.

@syphar
Copy link
Member

syphar commented Jul 24, 2022

I like the idea of nofollow in the dependencies links. That could reduce unnecessary crawling too.

@jsha @alecmocatta was also talking about setting old versions to nofollow, not only the dependencies.

With a hint which way helps I'm happy to help too. Which links shouldn't be followed?

If we fully exclude old versions that's probably a different discussion since we would completely exclude them from the index.

This would definitely increase the sitemap's size quite much, we have crates with > 1 Mio files ( which are probably quite many pages to be added to the sitemap).
We could limit the links per crate to reduce this problem. Another problem is that sitemaps are limited to 50k URLs. We have a sitemap index linking to various sitemaps, but that can only be nested one deep. Still, I like this idea. A related one: link to the crate root and all.html from the sitemap.

Thinking longer about this, this is a tough nut to crack. We don't have the generated docs in the database but only on S3. So generating the pages for a crate would involve an additional request to S3 for each crate.

I have some ideas how to solve this, but it's only worth the effort as a last option, if I'm not missing something.

@jsha
Copy link
Contributor Author

jsha commented Jul 24, 2022

@jsha @alecmocatta was also talking about setting old versions to nofollow, not only the dependencies.

Thanks for clarifying. I think it makes sense to nofollow links to the old versions too, since #1773 means a search engine would crawl those, only to wind up canonicalizing them. So those are just wasted fetches.

Thinking longer about this, this is a tough nut to crack. We don't have the generated docs in the database but only on S3. So generating the pages for a crate would involve an additional request to S3 for each crate.

I have some ideas how to solve this, but it's only worth the effort as a last option, if I'm not missing something.

Yep, I agree it's not that necessary. Particularly given we have all.html available if we want to help search engines consistently discover all items in each crate. Presumably for an all.html with 1M links, a search engine would disregard links beyond some cutoff.

@LeSnake04
Copy link

Thinking longer about this, this is a tough nut to crack. We don't have the generated docs in the database but only on S3. So generating the pages for a crate would involve an additional request to S3 for each crate.

I have some ideas how to solve this, but it's only worth the effort as a last option, if I'm not missing something.

Maybe just add all and the modules, structs and re-exports in the crate root (like bevy::prelude, bevy::app, ...)

@syphar
Copy link
Member

syphar commented Jul 24, 2022

Thinking longer about this, this is a tough nut to crack. We don't have the generated docs in the database but only on S3. So generating the pages for a crate would involve an additional request to S3 for each crate.
I have some ideas how to solve this, but it's only worth the effort as a last option, if I'm not missing something.

Maybe just add all and the modules, structs and re-exports in the crate root (like bevy::prelude, bevy::app, ...)

Good idea in general, but in that regard docs.rs is "only" serving static files from rustdoc, so has no detailed knowledge about the documentation files apart from some exceptions (#1781). We have the list of source-files for our source-browser with which we could generate the module list, but this would tightly bind docs.rs to rustdoc implementation details (its file structure).

So similar effort, so only when it's worth it :)

@jsha
Copy link
Contributor Author

jsha commented Aug 9, 2022

Update:

Checking the latest data from the Google Search Console, it is still finding many pages that are "duplicate without user-selected canonical", but spot-checking them, they are all crates that have a separate documentation URL and so are not getting the <link rel="canonical"> treatment.

The Search Console allows exporting a report of click-through data, and it turns out to be an interesting way to find examples of URLs with this problem: the pages that have the highest click-through rates tend to be ones that have the "versioned URL" problem. For instance https://docs.rs/rand/0.6.5/rand/fn.thread_rng.html is the page with the single highest click-through rate on docs.rs, presumably because people search for thread_rng or rust thread_rng and it is the top result. Unfortunately, 0.6.5 is out of date: 0.8.5 is the latest version. The same pattern holds for all of the top pages by click-through rate.

I followed the thread_rng example further, and in the Search Console "inspected" the URL. It turns out https://docs.rs/rand/0.6.5/rand/fn.thread_rng.html is considered canonical - it doesn't have a <link rel="canonical">. That surprised me because https://docs.rs/rand/latest/rand/fn.thread_rng.html does have <link rel="canonical"> (the crate's documentation URL is https://docs.rs/rand).

It turns out version 0.6.5 had a different documentation URL: https://rust-random.github.io/rand. Since we don't render <link rel="canonical"> on crates that have a documentation URL, version 0.6.5 has no canonical link, even though the latest version of the crate does have one.

I think we need to provide <link rel="canonical"> on old crates even when they have their own documentation URL. To make things simpler I think it makes sense to remove entirely the exclusion for crates with their own documentation URL. I think that will not make the situation any worse for crates that want their self-hosted documentation to be canonical, and there are possibilities to make the situation better. I'll post on #74 again.


Another interesting result: https://docs.rs/futures/0.1.11/futures/future/type.BoxFuture.html also has a high click-through rate, and that URL does have a <link rel="canonical">. When I "inspect" it in the Search Console, I get:

image

In other words, Google sees our canonical link, parses it, and chooses to ignore it in favor of considering 0.1.11 to be the canonical URL. It's not clear why that is; perhaps version 0.1.11 has more inbound links, or has a long history of being canonical. 0.3.1 is the latest version for that crate.

@jyn514
Copy link
Member

jyn514 commented Aug 9, 2022

I think we need to provide on old crates even when they have their own documentation URL. To make things simpler I think it makes sense to remove entirely the exclusion for crates with their own documentation URL. I think that will not make the situation any worse for crates that want their self-hosted documentation to be canonical, and there are possibilities to make the situation better. I'll post on #74 again.

That makes sense to me; we can treat the self-hosted docs as canonical for the latest version only.

@syphar
Copy link
Member

syphar commented Jan 19, 2023

@jsha coming from your last comments regarding the google check:

what do you think about closing this issue, and possibly #74 ?

@jsha
Copy link
Contributor Author

jsha commented Jan 19, 2023

I have a more recent comment; we still haven't finished a full recrawl. I'm happy to close this issue since the basic proposal is done, but if we wanted to treat it as more of a tracking issue we would keep it open since there's still additional work to do.

@jsha
Copy link
Contributor Author

jsha commented Mar 14, 2023

On Feb 3 we deployed a change adding noindex to versioned rustdoc URLs. As of today, only 7 of the top 1000 pages visited from Google Search have a version in the URL. Presumably those just haven't been recrawled yet to see the noindex directive. By contrast, as of July 2022, 305 of the top 1000 pages had a version in the URL.

Of those 305 pages, 147 of them have their /latest/ equivalent in today's top 1000. I've spot-checked some of the rest and they have various explanations. For some, it's a different between / vs /index.html at the end of the URL. For one it was a change of module name, since the crate was revised in the meantime. For many of them I think it's either ranking changes - for instance Google now prefers the module page for futures::stream over the trait page for futures::stream::Stream - or popularity changes, such that a given item is not longer a popular enough search to be in the top 1000.

I did stumble across one anomaly: [parse_macro_input], for which we used to rank #1, now points to https://rcos.io/static/internal_docs/syn/macro.parse_macro_input.html, which I assume is not the official documentation because it was generated with an old rustdoc (and has /internal_docs/) in the name. The Docs.rs page doesn't show up in the top 10. Looking at Search Console, Google is still (incorrectly) canonicalizing https://docs.rs/syn/latest/syn/macro.parse_macro_input.html to https://docs.rs/syn/1.0/syn/macro.parse_macro_input.html, which is then excluded due to the noindex directive. I assume this is a temporary inconsistency and will resolve itself in time.

I'm satisfied that this issue is solved. 🎉

@jaskij
Copy link

jaskij commented Dec 17, 2024

Commenting here, because it seems to be relevant, and the current state seems to be worse than it should be.

Right now, Google seems to rank docs.rs quite low, and never the front page of the crate. As a quick example from my experience today:

  • searching for tokio-postgres:
    • crates.io, unversioned
    • next three results are not docs.rs at all
    • next is https://docs.rs/tokio-postgres-rustls
    • some other results
    • last on the first page is https://docs.rs/tokio-postgres/latest/tokio_postgres/config/struct.Config.html
  • and then: tokio_postgres
    • crates.io, unversioned
    • rust-postgres/tokio-posgres/src/lib.rs on GitHub
    • https://docs.rs/tokio-postgres/latest/tokio_postgres/types/index.html
    • a 2021 StackOverflow question about the crate
    • https://docs.rs/tokio-postgres/latest/tokio_postgres/struct.Client.html
    • more relevant results

This has been a consistent experience for me over the past month or two, and makes it annoying. crates.io is pretty consistently the first result though, so I started clicking through from there.

@jsha
Copy link
Contributor Author

jsha commented Dec 18, 2024

Thanks for the info! I will look into this.

@jsha jsha reopened this Dec 18, 2024
@jaskij
Copy link

jaskij commented Dec 18, 2024

Some more observations:

  • it's only for some crates, many work just fine
  • adding rust at the start for tokio-postgres (or _) searches didn't change a thing

Off the top of my mind I can name bytes and tokio-postgres as crates with the issue. Considering I worked on implementing a binary protocol for the last four weeks, that would make sense why I got the impression it was more common.

Other crates, like cidr and ipnetwork when searched for with the rust prefix worked just fine: docs.rs, crates.io, other results.

@jsha
Copy link
Contributor Author

jsha commented Dec 18, 2024

Thanks for pointing this out, we may have an issue. Google Search Console says https://docs.rs/tokio-postgres/latest/tokio_postgres/ is not indexed because it's a "duplicate without user-selected canonical." And the Google-selected canonical URL is https://docs.rs/tokio-postgres/%5E0.7.7. But also https://docs.rs/tokio-postgres/%5E0.7.7 is not in Google's index because:

'noindex' detected in 'X-Robots-Tag' http header

In general we try to convince Google that it should only index the "latest" version of a given package. We do that by setting that X-Robots-Tag header. In theory that prevents Google from seeing lots of near-duplicate pages (different versions of the same package) and incorrectly choosing one of them as canonical.

In this case, that defense didn't work. I'm not sure yet exactly why or what the fix might be.

@syphar
Copy link
Member

syphar commented Dec 19, 2024

@jsha thanks for staying on top of this topic!

Just a thought, and I'm not sure if we never had it, or I accidentally dropped it at some point:

Shouldn't we also set link rel="canonical" in rustdoc pages to make this work? Currently I don't see it in the generated HTML.

@syphar
Copy link
Member

syphar commented Dec 19, 2024

digging deeper:

  • from what I read in the code, on pages like https://docs.rs/crate/rand/latest we have the canonical URL in the HTML, also we should have it in the HTTP header, which I don't see in the response. will test locally if it's perhaps filtered by cloudfront?
  • that being said, the rustdoc page doesn't have either, digging deeper too

@syphar
Copy link
Member

syphar commented Dec 19, 2024

Ah, I see, the canonical link added in 61bca32 was replaced with noindex on the versioned pages in 27af44b.

I feared one of the many refactors accidentally dropped the header 😅

@syphar
Copy link
Member

syphar commented Dec 19, 2024

Now I'm back to: I don't know what the solution to this is :)

@jsha
Copy link
Contributor Author

jsha commented Dec 19, 2024

Looking at Google Search console's "Duplicate without user-selected canonical" report, this affects about 109k URLs right now, very slightly up from September of this year. That's out of 1M pages indexed and 7M pages not indexed (mainly the versioned URLs, rejected by X-Robots-Tag.

The first page's worth is all URLs like this one, the latest for a given crate. They're in no particular order but here's a sample, along with their Google-selected canonical (for some of them):

https://docs.rs/googleapis-tonic-google-cloud-bigquery-datapolicies-v1/latest/googleapis_tonic_google_cloud_bigquery_datapolicies_v1/
https://docs.rs/googleapis-tonic-google-cloud-bigquery-datapolicies-v1

https://docs.rs/llist/latest/llist/
https://docs.rs/llist

https://docs.rs/aws-sdk-kinesisvideo/latest/aws_sdk_kinesisvideo/
https://docs.rs/aws-sdk-kinesisvideo

https://docs.rs/snitch-protos/latest/protos/
https://docs.rs/snitch-protos

https://docs.rs/mtag-cli/latest/mtag_cli/
https://docs.rs/mtag-cli

https://docs.rs/texture-synthesis/latest/texture_synthesis/
https://docs.rs/texture-synthesis

https://docs.rs/em7180/latest/em7180/
https://docs.rs/em7180

https://docs.rs/midenc-hir-analysis/latest/midenc_hir_analysis/
https://docs.rs/midenc-hir-analysis

https://docs.rs/fat32/latest/fat32/
https://docs.rs/fat32

https://docs.rs/vls-protocol-client/latest/vls_protocol_client/
https://docs.rs/vls-protocol-client

The pattern for these is fairly straightforward: Google seems to prefer the shorter docs.rs/<cratename> URL, which I'm guessing is linked from various places (crates.io?). And in those cases the shorter name URL is actually indexed and everything is fine.

The tokio-postgres example is different because https://docs.rs/tokio-postgres/%5E0.7.7 (aka /^0.7.7) is not indexed (and shouldn't be, because it's blocked by X-Robots-Tag. But how does it get chosen as the canonical?

I don't know if this would fix anything, but I'd love to use robots.txt better. Right now Google has to fetch each URL before determining most of them are noindex. Not the most effective use of whatever crawl time they allocate to us.

It's a little challenging to write an Allow or Disallow rule that does what we want, because we want to basically Disallow: * / Allow: */latest/*, but I can't find any documentation confirming two wildcards works.

We could do something hacky like Disallow: *^, which I think would disallow all URLs that have a ^ in them. It seems like our special version-selector URL syntax does tend to be implicated in some of our weird cases.

It might be nice to have a whole separate URL space that starts with /latest/, e.g. https://docs.rs/latest/tokio-postgres. We could allow that, and disallow the rest. But I think that would be pretty disruptive. Lots of stuff links to the current /latest/ URL scheme.

@syphar
Copy link
Member

syphar commented Dec 19, 2024

Google seems to prefer the shorter docs.rs/ URL, which I'm guessing is linked from various places (crates.io?). And in those cases the shorter name URL is actually indexed and everything is fine.

perhaps because we return a 302 on that endpoint, and not 301? But if everything is fine in this case, good :)

I don't know if this would fix anything, but I'd love to use robots.txt better. Right now Google has to fetch each URL before determining most of them are noindex. Not the most effective use of whatever crawl time they allocate to us.

It's a little challenging to write an Allow or Disallow rule that does what we want, because we want to basically Disallow: * / Allow: /latest/, but I can't find any documentation confirming two wildcards works.

It would probably have to include some more pages that are static, but I get the picture.

Is there any way to test this that won't disrupt discoverability if we're wrong?

It might be nice to have a whole separate URL space that starts with /latest/, e.g. https://docs.rs/latest/tokio-postgres. We could allow that, and disallow the rest. But I think that would be pretty disruptive. Lots of stuff links to the current /latest/ URL scheme.

true

@jsha
Copy link
Contributor Author

jsha commented Dec 19, 2024

Looking again at the Search Console page for https://docs.rs/tokio-postgres/^0.7.7, I see that it has a number of referring pages:

https://docs.rs/datafusion/33.0.0/datafusion/
https://docs.rs/datafusion/36.0.0/datafusion/common/arrow/array/type.Float32BufferBuilder.html
https://docs.rs/postgres/latest/postgres/types/trait.ToSql.html
https://docs.rs/postgres/latest/postgres/transaction/struct.Transaction.html

That probably helps boost its rank, and thus its claim to be the canonical page in Google's estimation.

Looking at some of these, e.g. struct.Transacation.html, the link comes in via the menu item that links to dependencies (and it links to specific versions of those dependencies).

Since we never want search engines to index a non-latest version of docs, we also don't want them to follow these links. I think we should add rel="nofollow" to the menu item links for dependencies. What do you think @syphar?

I'd also like to try:

We could do something hacky like Disallow: *^, which I think would disallow all URLs that have a ^ in them. It seems like our special version-selector URL syntax does tend to be implicated in some of our weird cases.

Among other things, it could generate results faster (since Google only has to fetch robots.txt once, vs recrawling millions of pages to see that those links are nofollow now).

Edit: actually, if we do the robots.txt thing, we don't need to do the rel="nofollow" thing, because robots.txt would have the same effect. All internal links to dependencies use the ^ URL.

@jsha
Copy link
Contributor Author

jsha commented Dec 19, 2024

Posted #2695 for the robots.txt change.

@syphar
Copy link
Member

syphar commented Dec 20, 2024

Since we never want search engines to index a non-latest version of docs, we also don't want them to follow these links. I think we should add rel="nofollow" to the menu item links for dependencies. What do you think @syphar?
Edit: actually, if we do the robots.txt thing, we don't need to do the rel="nofollow" thing, because robots.txt would have the same effect. All internal links to dependencies use the ^ URL.

Would it be good still to do both?

@syphar
Copy link
Member

syphar commented Dec 20, 2024

It's a little challenging to write an Allow or Disallow rule that does what we want, because we want to basically Disallow: * / Allow: /latest/, but I can't find any documentation confirming two wildcards works.

perhaps the mentioned library / CLI can help validating? and writing tests?

@jsha
Copy link
Contributor Author

jsha commented Dec 20, 2024

Would it be good still to do both? [robots.txt and nofollow on hrefs]

The nofollow on hrefs adds some maintenance burden and some byte size, and blocking those URLs in robots.txt accomplishes exactly the same thing.

That said, one thing this has me thinking about is having the "Dependencies" menu link to the /latest/ versions of packages. The might help give search engines a bit of an "importance" signal by seeing which packages get lots of inbound links (because they are depended on by lots of other packages). That might help get more URLs crawled in the most important pacakges.

@syphar
Copy link
Member

syphar commented Dec 21, 2024

Would it be good still to do both? [robots.txt and nofollow on hrefs]

The nofollow on hrefs adds some maintenance burden and some byte size, and blocking those URLs in robots.txt accomplishes exactly the same thing.

👍

That said, one thing this has me thinking about is having the "Dependencies" menu link to the /latest/ versions of packages. The might help give search engines a bit of an "importance" signal by seeing which packages get lots of inbound links (because they are depended on by lots of other packages). That might help get more URLs crawled in the most important pacakges.

Wouldn't that kind of skew the meaning? I mean when I see incompatible major releases in a dependency I probably would want to be linked to the major version that the crate depends on. But since we're blocking the prefixes anyways, it won't matter right now

@jsha
Copy link
Contributor Author

jsha commented Jan 7, 2025

Update: Google still considers https://docs.rs/tokio-postgres/^0.7.7 to be the canonical version and https://docs.rs/tokio-postgres/latest/tokio_postgres/ to be the "Duplicate without user-selected canonical."

It also says https://docs.rs/tokio-postgres/^0.7.7 was last crawled Jan 6, 2025, 9:20:41 AM, which suggests our robots.txt changes are failing to block that URL. I did check and https://docs.rs/robots.txt is a redirect to https://docs.rs/-/static/robots.txt, which contains what we expect based on the recent PR:

Sitemap: https://docs.rs/sitemap.xml
# Semver-based URL are always redirects, and sometimes
# confuse Google's duplicate detection, so we block crawling them.
# https://docs.rs/about/redirections
User-Agent: *
Disallow: */^
Disallow: */~

Google does recognize the robots.txt, because we now have 97 "Blocked by robots.txt" URLs. However, that's low! Checking the "examples", I see that only the URLs with the tilde (~) are blocked, not URLs with the caret (^). So probably something is odd about our handling / Google's handling of the caret in URLs. Perhaps whether it's escaped?

Image

@jaskij
Copy link

jaskij commented Jan 8, 2025

I did a little digging. First, from RFC 3986.

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

Looking in developer tools, both Chromium and Firefox actually do URL encode. Your guess that Google's robot does not perform normalization is likely.

Some more looking around led me to google/robotstxt#64 which links RFC 9309. To quote the relevant paragraph:

If a percent-encoded ASCII octet is encountered in the URI, it MUST be unencoded prior to comparison, unless it is a reserved character in the URI as defined by [RFC3986] or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered.

To my reading, Google is not following the RFCs correctly here. Given the wording of RFC3986 Section 2.3, it's an easy mistake to make.

@jsha
Copy link
Contributor Author

jsha commented Feb 4, 2025

Update: this didn't affect behavior much. We have an increase in pages that are "Indexed, though blocked by robots.txt". Which makes me realize our approach was somewhat wrong, because I got confused about norobots vs noindex. Google will happily index something it can't crawl (or hasn't yet crawled), which is part of our problem here.

I think we should now go forward with adding rel="nofollow" on the internal links we generate to caret-style URLs.

Another change: On the redirects for caret-style URLs, we can serve X-Robots-Tag: noindex in the HTTP headers. This should just be on the redirects, not on the landing pages. We'd have to accompany this change with a revert to the robots.txt change we recently landed, so Google would actually be able to crawl those URLs and see the noindex. But combined with the nofollow change described above, we should hopefully see less crawling of those URLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants