Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rel=canonical support #74

Closed
briansmith opened this issue Oct 20, 2016 · 29 comments
Closed

Add rel=canonical support #74

briansmith opened this issue Oct 20, 2016 · 29 comments
Labels
A-frontend Area: Web frontend P-medium Medium priority

Comments

@briansmith
Copy link

briansmith commented Oct 20, 2016

Please add the correct <link rel=canonical> link to each generated documentation page, when the crate provides a canonical documentation URL. This will help search engines disambiguate the docs.rs copies of the documentation and avoid docs.rs looking like a content farm or spam blog to search engines.

@onur
Copy link
Member

onur commented Oct 20, 2016

Thanks for this awesome idea. I didn't know about rel=canonical.

I'll definitely add this, but I need to save original documentation links first. That is only thing I forgot to save when building crates.

@jyn514
Copy link
Member

jyn514 commented Nov 27, 2019

@GuillaumeGomez is this something that should be done by docs.rs? It seems pretty easy to implement but I think it would make more sense to have a rel=canonical link in every page generated with rustdoc, not just the ones on docs.rs.

@GuillaumeGomez
Copy link
Member

If I understand correctly, it's supposed to be used in the <head> part. However, I'm not sure if this'll be really useful here considering that all URLs are unique in our case... Also, the URL needs to include the domain name, so if we want to add it, I think it should be done on docs.rs side.

@jyn514
Copy link
Member

jyn514 commented Nov 27, 2019

This is (correct me if I'm wrong) for documentation URLs that are explicitly given in Cargo.toml, not for different versions of the docs. See for example https://github.com/serde-rs/serde/blob/master/serde/Cargo.toml#L9. The domain name is given here, it's not necessarily the same site that's currently hosting the docs.

@GuillaumeGomez
Copy link
Member

Oh in this case it makes sense to have canonical then I guess. But still on docs.rs side. :)

@jyn514 jyn514 self-assigned this Nov 27, 2019
@jyn514 jyn514 added A-frontend Area: Web frontend P-medium Medium priority and removed important labels Jun 27, 2020
@workingjubilee
Copy link
Member

workingjubilee commented Aug 27, 2020

P-higher than medium maybe but not existentially critical so not high ( I mean I guess that's the important label? ), https://docs.rs/tracing/0.1.19/tracing existing punishes e.g. https://tracing.rs/tracing. This needs to be resolved sooner rather than later.

@workingjubilee
Copy link
Member

I don't know for certain that for an explicitly specified documentation field the <link rel="canonical"> shouldn't be fixed on the rustdoc side given that the most natural place to do this is at build time, since it is extracted from the Cargo.toml and resolves to a URL. Because it is, it is actually part of crates.io's public API to link to that URL.

@jyn514
Copy link
Member

jyn514 commented Aug 27, 2020

@workingjubilee this is part of cargo.toml, not the source code. It's possible to add an unstable rustdoc flag but since we'd be the only ones using it I'd rather just inject it with the rest of the styles.

@workingjubilee
Copy link
Member

Whoops, yeah, that would have to be as part of cargo doc and no deeper, ideally.

@pietroalbini
Copy link
Member

Hmm, I'm not sure relying on the package.documentation key of Cargo.toml is the best idea for this: that key could point to (for example) a mdbook, and will probably break when adding rel="canonical" to subpages.

For people that explicitly want this I'd add something like this:

[package.metadata.docs.rs]
canonical-url = "https://api.example.com/{version}/{path}"

@pietroalbini
Copy link
Member

pietroalbini commented Aug 28, 2020

An explicit metadata key with placeholders could also be used to display a banner in the UI:

2020-08-28--11-02-33

@jyn514
Copy link
Member

jyn514 commented Aug 28, 2020

that key could point to (for example) a mdbook, and will probably break when adding rel="canonical" to subpages.

I was imagining it would point to the landing page even from sub pages. That way you wouldn't get spurious 404s.

I like the idea of a banner with a link that opens in a new tab though, that seems much more discoverable.

@briansmith do you have a preference of what this should look like?

@pietroalbini
Copy link
Member

I was imagining it would point to the landing page even from sub pages. That way you wouldn't get spurious 404s.

That's not how rel="canonical" works: a page with that tag should have the same contents as the canonical one. We can assume https://docs.rs/rustwide/struct.Workspace.html and https://example.com/rustwide/struct.Workspace.html have roughly the same contents, but https://docs.rs/rustwide/struct.Workspace.html and https://example.com/rustwide/ definitely do not have the same content in them.


I thought about this a bit more, and here's is how I would implement this.

Goals

  • We should respect the crate author's wishes to have their documentation hosted outside of docs.rs.
  • We should ensure people can always view the documentation on docs.rs for every version if they explicitly wish to.

rel="canonical"

If the crate doesn't specify a package.metadata.docs.rs.canonical-url, the canonical URL will always be https://docs.rs/{crate}/latest/{path}. When an user visits that page, docs.rs will not redirect, but instead serve the latest version of that page.

If a package.metadata.docs.rs.canonical-url is specified, the canonical URL will be that path, with {path} replaced with the file path. If {path} is not present the path will be appended at the end of the URL.

Banner

If a package.metadata.docs.rs.canonical-url is specified, on the latest version we will display a banner similar to this:

2020-08-28--11-02-33

When the user clicks on it they will get redirected to the canonical URL. The banner will be collapsable, and that preference will be stored on the user's browser (each crate will have its own separate cookie to collapse it). Since the metadata won't allow to specify a version, the banner will only be shown on the latest version.

@jyn514
Copy link
Member

jyn514 commented Aug 28, 2020

Since the metadata won't allow to specify a version, the banner will only be shown on the latest version.

This doesn't seem right - metadata is per-release, not per crate. So any release that specifies a canonical URL should have the banner.

Hmm, I'm not sure relying on the package.documentation key of Cargo.toml is the best idea for this: that key could point to (for example) a mdbook, and will probably break when adding rel="canonical" to subpages.

[The way] rel="canonical" works [is] a page with that tag should have the same contents as the canonical one.

Ok, that makes sense, but I don't want to introduce yet another docs.rs specific toggle. There are a lot of existing crates using package.documentation and it's standardized by Cargo, I don't want to make a new button if we can avoid it. Maybe we could only add rel=canonical for canonical-url, but still add the banner for package.documentation? That way we wouldn't have incorrect rel=canonical but it would still be clear that the author has a preferred documentation site without having to jump through hoops.

All the other suggestions about the banner seem fine to me, although I'd be interested in Brian's opinion.

@pietroalbini
Copy link
Member

This doesn't seem right - metadata is per-release, not per crate. So any release that specifies a canonical URL should have the banner.

That only works if the crate owner publishes each version to a different URL and updates the URL in the Cargo.toml every time. Neither ring nor tracing, for example, do so. Assuming that someone who views the documentation for an old version is explicitly interested in that version, the banner will point to a completly different thing. As an user, I wouldn't want to get a banner to the tokio 0.2 docs if I'm visiting the 0.1 docs.

Maybe we could only add rel=canonical for canonical-url, but still add the banner for package.documentation? That way we wouldn't have incorrect rel=canonical but it would still be clear that the author has a preferred documentation site without having to jump through hoops.

Then we need to tweak the wording of the banner to something like "the author also provides additional documentation on [domain name]". That makes sense and it would be very useful to link (for example) to the serde book, but it has a completly different meaning than the banner in my mockup.

@workingjubilee
Copy link
Member

workingjubilee commented Aug 29, 2020

"The authors of this crate prefer to host their documentation on (website)" sounds weird and a little passive-aggressive, frankly.

And a crate author preferring to host a new version is not a big deal. Users actively complain when they find old versions. That's kind of the entire point of wanting to canonicalize anything at all: It makes Rust look horribly dated and incapable of organizing its own documentation when the first result in searches is something from 2015~2016 even though several major versions have come and gone since then. So while other issues apply, sure, there is not a problem with pointing to an author's preferred up-to-date version. It's quite relevant to point to it even on older versions because that provides information on what the author actually intends to support.

Many times people have asked me how to fix a problem with a crate and I have suggested, several times, of "try using a newer version" and lo and behold...! It would be slightly extreme to make the tokio 0.1 docs cease to exist so I stop finding them, yet that would much better match the typical desire.

@pietroalbini
Copy link
Member

"The authors of this crate prefer to host their documentation on (website)" sounds weird and a little passive-aggressive, frankly.

That's totally fair, I didn't spend much time thinking about the message in a mockup :)

And a crate author preferring to host a new version is not a big deal. Users actively complain when they find old versions. That's kind of the entire point of wanting to canonicalize anything at all: It makes Rust look horribly dated and incapable of organizing its own documentation when the first result in searches is something from 2015~2016 even though several major versions have come and gone since then. So while other issues apply, sure, there is not a problem with pointing to an author's preferred up-to-date version. It's quite relevant to point to it even on older versions because that provides information on what the author actually intends to support.

Many times people have asked me how to fix a problem with a crate and I have suggested, several times, of "try using a newer version" and lo and behold...! It would be slightly extreme to make the tokio 0.1 docs cease to exist so I stop finding them, yet that would much better match the typical desire.

If we just want to set rel="canonical" to point at docs.rs's latest version in the initial iteration that's also ok! That should solve the immediate problem.

@jyn514
Copy link
Member

jyn514 commented Aug 31, 2020

I am strongly against setting rel=canonical to docs.rs without looking at package.documentation.

https://docs.rs/tracing/0.1.19/tracing existing punishes e.g. https://tracing.rs/tracing.

We should fix that before we try and improve our own SEO.

@pietroalbini
Copy link
Member

We can't use the value of package.documentation as the rel="canonical" though. I extracted the value of the field across the latest published versions of every crate, and while some of them indeed point to the API documentation, there are also a bunch that link to non-rustdoc documentation (like an user guide) or to the GitHub repository/README.

@jsha
Copy link
Contributor

jsha commented Jul 14, 2022

As noted in #1438, lack of rel="canonical" seems to be hurting search results within docs.rs too. Given a plethora of identical content across various releases of a crate, Google chooses one effectively at random, which can cause a page from a newer version of a crate to be excluded in favor of that same page from an older version of a crate. I have a two-part proposal:

Whichever tag we emit, we would do it in docs.rs at page load time, not in rustdoc and not at doc build time. This allows us to include the tags even for old releases, and to remove them easily if they turn out not to have the effect we want.

@jsha
Copy link
Contributor

jsha commented Jul 20, 2022

  • For crates with a package.documentation that does not start with https://docs.rs, set no rel="canonical", and set <meta name="robots" content="noindex"> (https://developers.google.com/search/docs/advanced/crawling/block-indexing). This should prevent the docs.rs documentation from competing with self-hosted documentation for canonical status, while leaving the docs.rs pages available for users who navigate there directly.

An update on this: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls says:

Don't use noindex as a means to prevent selection of a canonical page. This directive is intended to exclude the page from the index, not to manage the choice of a canonical page.

Also, checking one of the examples above, https://tracing.rs has <meta name="robots" content="noindex"> in its source. Presumably the intent is that people should find the doc.rs page on search rather than the prerelease docs on https://tracing.rs. If the tracing crate set its documentation URL to https://tracing.rs (it doesn't), we would wind up in the tricky situation where neither https://tracing.rs nor https://docs.rs/tracing showed up in search indexes.

Maybe this isn't a big deal - I haven't surveyed the list of crates with documentation URLs that @pietroalbini thoughtfully provided to see how common it might be. But it makes me realize that adding noindex on docs.rs any time there is a documentation URL may be too aggressive.

@syphar
Copy link
Member

syphar commented Jul 20, 2022

Also, checking one of the examples above, https://tracing.rs has <meta name="robots" content="noindex"> in its source. Presumably the intent is that people should find the doc.rs page on search rather than the prerelease docs on https://tracing.rs. If the tracing crate set its documentation URL to https://tracing.rs (it doesn't), we would wind up in the tricky situation where neither https://tracing.rs nor https://docs.rs/tracing showed up in search indexes.

Maybe this isn't a big deal - I haven't surveyed the list of crates with documentation URLs that @pietroalbini thoughtfully provided to see how common it might be. But it makes me realize that adding noindex on docs.rs any time there is a documentation URL may be too aggressive.

When we don't want to exclude the docs.rs pages from google, even when we have a documentation-url, then we should also return a normal canonical URL to our latest version, right?

This would only be problematic if documentation-url also points to rustdoc content, would it?

@syphar
Copy link
Member

syphar commented Jul 20, 2022

so google can choose between the versioned pages on docs.rs?

@jsha
Copy link
Contributor

jsha commented Aug 9, 2022

An update from #1438 (comment):

There are a fairly large number of pages that are not getting the /latest/ treatment in Google's index because they have a documentation URL that points somewhere other than docs.rs, which means they don't have <link rel="canonical"> (as I proposed above, and implemented). One particularly notable effect is that an older version of a crate can have a non-docs.rs URL, which will mean the older version doesn't get <link rel="canonical">, and may itself get incorrectly selected as canonical.

I think the simplest solution is to apply <link rel="canonical"> to all versions of all crates, and not make an exception for crates that have their own doc URL. Here's my reasoning:

For crates that have a self-hosted doc URL, we can't just point rel="canonical" at that doc URL. The doc URL could contain versioned URLs (like docs.rs); it could contain unversioned URLs (if only one version is hosted at a time); or it could be generic high-level documentation that doesn't match URLs one-for-one. Without know which, we'll get it wrong a good chunk of the time.

Given that, there's nothing we can do unilaterally to boost the ranking of that doc URL. Instead, we should make a mechanism available for crate authors to say "I would prefer my self-hosted documentation to show up on Google instead of docs.rs' documentation." For instance, we could provide a package.metadata.docs.rs field, something like noindex = true. When that field is present on the latest version of a crate, docs.rs would render all versions of that crate with <meta name="robots" content="noindex">. Or we could use the package.metadata.docs.rs.canonical-url field that @pietroalbini proposed in 2020. Either way I think we need to treat the latest version of this metadata field as affecting all versions of the crate. And I think we should skip the banner.

Assuming folks here agree with that conclusion, we can uncouple the two issues: fixing canonicalization within docs.rs, and offering a noindex option so crates can choose to boost their off-docs.rs documentation.

@jsha
Copy link
Contributor

jsha commented Aug 31, 2022

With #1792 released, instances of "Duplicate without user-selected canonical" have decreased, and nearly all of them are of the form https://docs.rs/crate/cargo-bump/1.1.0. In other words crate pages. We should give the same <link rel="canonical"> treatment to crate pages, though it's not as crucial as for doc pages, since almost no-one goes to crate pages from search (4 out of 1000 top pages visited from search, according to Google Search Console).

@syphar
Copy link
Member

syphar commented Sep 1, 2022

This is awesome! thank you for driving this forward.

My feeling would be that we should add the canonical url to these crate-pages too, for the sake of completenes, and then close this issue.

@jsha
Copy link
Contributor

jsha commented Sep 16, 2022

nearly all of them are of the form https://docs.rs/crate/cargo-bump/1.1.0. In other words crate pages.

In #1829, with some help, I realized the common factor for these remaining duplicates is not just that they are crate pages. They are crate pages for binary crates, which have no docs. And Google is considering them duplicates of, e.g. https://docs.rs/cargo-bump, since there's a 302 (temporary) redirect from https://docs.rs/cargo-bump/ to https://docs.rs/crate/cargo-bump/1.1.0. We don't want that to change, since that redirect could change in the future if the binary crate adds docs. So this subset of URLs will just trigger duplicate detection forever, which is fine.

Spot-checking, about 50% of recently crawled duplicate URLs fall in that category, while about 45% fall in the category that will be fixed by #1829.

Meanwhile, for the overall problem, this graph is encouraging:

image

It shows "Duplicate without user-selected canonical" going from 1,158,432 to 862,124 over the course of about 85 days, for drop of 296,308 URLs or 3,485 per day. At this rate it will take about 247 days to go to zero, although in reality it will flatten out at some non-zero level eventually.

According to another page on the Search Console we get about 85k crawls per day, of which 75% is refresh and 25% is discovery.

There's another report on the Search Console for "Indexed pages" - those that made it through duplicate detection and will show up in search results. Here's a sample of recently crawled indexed pages:

https://docs.rs/harfbuzz-sys/0.1.15/harfbuzz_sys/fn.hb_buffer_create.html
https://docs.rs/ux_serde/0.2.0/ux_serde/struct.i103.html
https://docs.rs/nom/5.1.2/nom/macro.flat_map.html
https://docs.rs/opentelemetry/0.16.0/src/opentelemetry/metrics/value_recorder.rs.html
https://docs.rs/druid/0.6.0/druid/struct.FileDialogOptions.html
https://docs.rs/cookie/0.13.1/src/cookie/draft.rs.html
https://docs.rs/medea/latest/medea/
https://docs.rs/crate/gtk/0.4.0
https://docs.rs/seahorse/0.7.1/seahorse/struct.Command.html
https://docs.rs/winapi/0.3.7/winapi/um/wincrypt/constant.CERT_NOT_BEFORE_FILETIME_PROP_ID.html
https://docs.rs/ibm_db/0.1.6/?search=_IMAGE_THUNK_DATA64
https://docs.rs/axum/latest/axum/body/index.html

As you can see, a surprising number of versioned URLs are still getting indexed instead of dup'ed out by the canonical tag (659 out of 1000). Inspecting these URLs in the search console shows that Google is aware of the canonical tag but disregarded it and considers the versioned URL canonical:

image

The common factor in these is that they had a referring link from a versioned page. I suspect Google is weighting the existing links more heavily than the canonical tag when making the decision. Probably this effect will lessen with time as more of the existing pages are recrawled and canonicalized.

We might be able to speed up the process by bumping the <lastmod> tag in our sitemaps. Right now that date reflects the most recent build for the crate, and a lot of crates are very infrequently built, which leads Google not to crawl their docs. We could bring all the <lastmod> tags up to the date we added canonical tags.

@jsha
Copy link
Contributor

jsha commented Mar 14, 2023

I've closed out #1438. For duplicates within docs.rs, we solved the problem by setting noindex on outdated versions.

I'm also closing out this issue. Based on the discussion above, automatically setting a <link rel="canonical"> based on a crate's documentation field won't work, because we don't know what format the documentation is in and how to map individual pages to it. If we want to solve the original problem (docs.rs causes self-hosted docs to not appear in search), we'll do it by adding an explicit mechanism for crates to indicate noindex for their pages on docs.rs.

@jsha jsha closed this as completed Mar 14, 2023
@andrewtj
Copy link

andrewtj commented Mar 17, 2023

If we want to solve the original problem (docs.rs causes self-hosted docs to not appear in search), we'll do it by adding an explicit mechanism for crates to indicate noindex for their pages on docs.rs.

Would including a snippet via rustdoc's --html-in-header feature work?

Cargo.toml:

[package.metadata.docs.rs]
rustdoc-args = ["--html-in-header", "noindex.html"]

noindex.html:

<meta name="robots" content="noindex">

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-frontend Area: Web frontend P-medium Medium priority
Projects
None yet
Development

No branches or pull requests

9 participants