-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rel=canonical
support
#74
Comments
Thanks for this awesome idea. I didn't know about I'll definitely add this, but I need to save original documentation links first. That is only thing I forgot to save when building crates. |
@GuillaumeGomez is this something that should be done by docs.rs? It seems pretty easy to implement but I think it would make more sense to have a |
If I understand correctly, it's supposed to be used in the |
This is (correct me if I'm wrong) for documentation URLs that are explicitly given in Cargo.toml, not for different versions of the docs. See for example https://github.com/serde-rs/serde/blob/master/serde/Cargo.toml#L9. The domain name is given here, it's not necessarily the same site that's currently hosting the docs. |
Oh in this case it makes sense to have canonical then I guess. But still on docs.rs side. :) |
P-higher than medium maybe but not existentially critical so not high ( I mean I guess that's the important label? ), |
I don't know for certain that for an explicitly specified documentation field the |
@workingjubilee this is part of cargo.toml, not the source code. It's possible to add an unstable rustdoc flag but since we'd be the only ones using it I'd rather just inject it with the rest of the styles. |
Whoops, yeah, that would have to be as part of cargo doc and no deeper, ideally. |
Hmm, I'm not sure relying on the For people that explicitly want this I'd add something like this: [package.metadata.docs.rs]
canonical-url = "https://api.example.com/{version}/{path}" |
I was imagining it would point to the landing page even from sub pages. That way you wouldn't get spurious 404s. I like the idea of a banner with a link that opens in a new tab though, that seems much more discoverable. @briansmith do you have a preference of what this should look like? |
That's not how I thought about this a bit more, and here's is how I would implement this. Goals
|
This doesn't seem right - metadata is per-release, not per crate. So any release that specifies a canonical URL should have the banner.
Ok, that makes sense, but I don't want to introduce yet another docs.rs specific toggle. There are a lot of existing crates using All the other suggestions about the banner seem fine to me, although I'd be interested in Brian's opinion. |
That only works if the crate owner publishes each version to a different URL and updates the URL in the
Then we need to tweak the wording of the banner to something like "the author also provides additional documentation on [domain name]". That makes sense and it would be very useful to link (for example) to the serde book, but it has a completly different meaning than the banner in my mockup. |
"The authors of this crate prefer to host their documentation on (website)" sounds weird and a little passive-aggressive, frankly. And a crate author preferring to host a new version is not a big deal. Users actively complain when they find old versions. That's kind of the entire point of wanting to canonicalize anything at all: It makes Rust look horribly dated and incapable of organizing its own documentation when the first result in searches is something from 2015~2016 even though several major versions have come and gone since then. So while other issues apply, sure, there is not a problem with pointing to an author's preferred up-to-date version. It's quite relevant to point to it even on older versions because that provides information on what the author actually intends to support. Many times people have asked me how to fix a problem with a crate and I have suggested, several times, of "try using a newer version" and lo and behold...! It would be slightly extreme to make the tokio 0.1 docs cease to exist so I stop finding them, yet that would much better match the typical desire. |
That's totally fair, I didn't spend much time thinking about the message in a mockup :)
If we just want to set |
I am strongly against setting rel=canonical to docs.rs without looking at package.documentation.
We should fix that before we try and improve our own SEO. |
We can't use the value of |
As noted in #1438, lack of rel="canonical" seems to be hurting search results within docs.rs too. Given a plethora of identical content across various releases of a crate, Google chooses one effectively at random, which can cause a page from a newer version of a crate to be excluded in favor of that same page from an older version of a crate. I have a two-part proposal:
Whichever tag we emit, we would do it in docs.rs at page load time, not in rustdoc and not at doc build time. This allows us to include the tags even for old releases, and to remove them easily if they turn out not to have the effect we want. |
An update on this: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls says:
Also, checking one of the examples above, https://tracing.rs has Maybe this isn't a big deal - I haven't surveyed the list of crates with documentation URLs that @pietroalbini thoughtfully provided to see how common it might be. But it makes me realize that adding |
When we don't want to exclude the docs.rs pages from google, even when we have a documentation-url, then we should also return a normal canonical URL to our latest version, right? This would only be problematic if documentation-url also points to rustdoc content, would it? |
so google can choose between the versioned pages on docs.rs? |
An update from #1438 (comment): There are a fairly large number of pages that are not getting the I think the simplest solution is to apply For crates that have a self-hosted doc URL, we can't just point Given that, there's nothing we can do unilaterally to boost the ranking of that doc URL. Instead, we should make a mechanism available for crate authors to say "I would prefer my self-hosted documentation to show up on Google instead of docs.rs' documentation." For instance, we could provide a package.metadata.docs.rs field, something like Assuming folks here agree with that conclusion, we can uncouple the two issues: fixing canonicalization within docs.rs, and offering a noindex option so crates can choose to boost their off-docs.rs documentation. |
With #1792 released, instances of "Duplicate without user-selected canonical" have decreased, and nearly all of them are of the form |
This is awesome! thank you for driving this forward. My feeling would be that we should add the canonical url to these crate-pages too, for the sake of completenes, and then close this issue. |
In #1829, with some help, I realized the common factor for these remaining duplicates is not just that they are crate pages. They are crate pages for binary crates, which have no docs. And Google is considering them duplicates of, e.g. https://docs.rs/cargo-bump, since there's a 302 (temporary) redirect from https://docs.rs/cargo-bump/ to https://docs.rs/crate/cargo-bump/1.1.0. We don't want that to change, since that redirect could change in the future if the binary crate adds docs. So this subset of URLs will just trigger duplicate detection forever, which is fine. Spot-checking, about 50% of recently crawled duplicate URLs fall in that category, while about 45% fall in the category that will be fixed by #1829. Meanwhile, for the overall problem, this graph is encouraging: It shows "Duplicate without user-selected canonical" going from 1,158,432 to 862,124 over the course of about 85 days, for drop of 296,308 URLs or 3,485 per day. At this rate it will take about 247 days to go to zero, although in reality it will flatten out at some non-zero level eventually. According to another page on the Search Console we get about 85k crawls per day, of which 75% is refresh and 25% is discovery. There's another report on the Search Console for "Indexed pages" - those that made it through duplicate detection and will show up in search results. Here's a sample of recently crawled indexed pages: https://docs.rs/harfbuzz-sys/0.1.15/harfbuzz_sys/fn.hb_buffer_create.html As you can see, a surprising number of versioned URLs are still getting indexed instead of dup'ed out by the canonical tag (659 out of 1000). Inspecting these URLs in the search console shows that Google is aware of the canonical tag but disregarded it and considers the versioned URL canonical: The common factor in these is that they had a referring link from a versioned page. I suspect Google is weighting the existing links more heavily than the canonical tag when making the decision. Probably this effect will lessen with time as more of the existing pages are recrawled and canonicalized. We might be able to speed up the process by bumping the |
I've closed out #1438. For duplicates within docs.rs, we solved the problem by setting noindex on outdated versions. I'm also closing out this issue. Based on the discussion above, automatically setting a |
Would including a snippet via rustdoc's Cargo.toml: [package.metadata.docs.rs]
rustdoc-args = ["--html-in-header", "noindex.html"] noindex.html: <meta name="robots" content="noindex"> |
Please add the correct
<link rel=canonical>
link to each generated documentation page, when the crate provides a canonical documentation URL. This will help search engines disambiguate the docs.rs copies of the documentation and avoid docs.rs looking like a content farm or spam blog to search engines.The text was updated successfully, but these errors were encountered: