Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seemingly duplicate results in search engines like Google #66

Closed
Flimm opened this issue Apr 26, 2022 · 14 comments
Closed

Seemingly duplicate results in search engines like Google #66

Flimm opened this issue Apr 26, 2022 · 14 comments
Labels
seo Related to search engine indexing and discoverability

Comments

@Flimm
Copy link
Contributor

Flimm commented Apr 26, 2022

Let's say I'm trying to find some documentation about postings. I open Google and search for "hledger posting". These are the results that I see:

image

As you can see, the first results look similar. The first three have the same title "journal manual - hledger", and they have the same breadcrumb "https://hledger.org › journal". The links of the first three results are:

(At the time of writing, the latest version of hledger is 1.25)

As you can see, these results are basically the same page, but for different versions of hledger. As you can imagine, this can make it harder to find what I was looking for.

Here are some suggestions:

  • Include the version number of hledger in the <title> tag of these pages. That way, the titles on Google will look different, like "journal manual - hledger 1.0", "journal manual - hledger 1.13", and so on.
  • Consider deindexing old versions of hledger from search engines, by adding <meta name="robots" content="noindex"> for older versions of hledger only, or consider removing older versions altogether
@simonmichael
Copy link
Owner

Great suggestions, thanks. I'll work on this if no-one beats me to it.

@simonmichael
Copy link
Owner

simonmichael commented May 6, 2022

I believe this is done. Changes:

  • manuals now always include the version in their HTML title
  • and in their site TOC links
  • unversioned manual urls like /hledger.html now redirect to the versioned url for current release
  • rendering should be more robust, with less chance of empty dev manuals
  • all manuals except the current release version have the noindex meta tag
  • unpackaged old versions have been dropped to save space/time

A test search: https://www.google.com/search?q=hledger+posting

Old: New:
Screen Shot 2022-05-06 at 03 04 11

@simonmichael
Copy link
Owner

simonmichael commented May 6, 2022

  • and a bunch more tuning of redirects for old manual urls.

I think search results (even just Google's) may take a long time to clean up, and probably it would be wise to create a sitemap.xml to help that along. Generating that is not supported in released mdbook just yet; I'd welcome suggestions on what to put in it.

@simonmichael simonmichael added the seo Related to search engine indexing and discoverability label May 6, 2022
@simonmichael
Copy link
Owner

Basic sitemap created, google reindexing pending.

@Flimm
Copy link
Contributor Author

Flimm commented May 9, 2022

That looks great! Thank you.

It occurred to me that would be better for SEO purposes to have stable URLs that are included in the Google index. What if https://hledger.org/stable/hledger.html worked (and didn't redirect anywhere)? That way, the /stable/ URLs could collect Google juice and improve their ranking.

A URL like https://hledger.org/1.25/hledger.html collects Google juice, but at some point it gets wasted, when a new version comes out. It gets wasted when the old version gets the noindex tag. Even without that tag, it gets wasted, since all the links on the web pointing to it do not get updated to point to the new version, and the Google juice gets divided up between multiple URLs. Sorry that I didn't think of suggesting this before.

@simonmichael
Copy link
Owner

simonmichael commented May 9, 2022

Good thoughts. My intent was always to have the easy https://hledger.org/hledger.html (hledger-ui.html, hledger-web.html) be the stable URLs for the manuals of the current release. IIRC previously this was done with symlinks or copies, and both URLs existed on the web. With the latest changes, /hledger.html is a redirect to /CURRENTVERSION/hledger.html, ie still a sort of "stable URL", and I think I saw today in google search console that they are correctly guessing /hledger.html as the canonical URL.

/hledger.html is missing from the new sitemap, though, so I should maybe add it there.

Though with reindexing still pending, it's a little hard to be sure what's what.
I'm assuming and hoping the old manuals will disappear from google search results fairly soon, because of the sitemap I've submitted which does not include them, and/or because they now contain noindex tags.

@simonmichael
Copy link
Owner

I'm slightly baffled. Taking https://hledger.org/1.0/hledger.html as an example old manual page, it now has the noindex tag, and a sitemap not including it was successfully submitted. After several days its Coverage status remains "Indexed, not submitted in sitemap / URL is on Google / It can appear in Google Search results (if not subject to a manual action or removal request)". Its last crawl date is.. May 1, 2022. When I request reindexing google says it can't be indexed because of the noindex tag. Docs say not to request removal, and to rely on the sitemap or noindex tag + reindexing instead. So... keep waiting and it will happen ?

@Flimm
Copy link
Contributor Author

Flimm commented May 19, 2022

I had a look at https://hledger.org/1.0/hledger.html . It seems that GoogleBot last crawled this page on 4 May 2022. That was before the noindex changes were rolled out. So we need to wait for GoogleBot to crawl this page again, or somehow prompt Googlebot to do that.

It's worthwhile distinguishing between the concept of crawling and indexing. We want Googlebot to crawl these pages, but we don't want it to index them. You said the tool informed you that the page "can't be indexed because of the noindex tag". I think that's the message we want and expect. I'm not sure why the page is still in the Google search results if it can't be indexed.

I also noticed that most of the URLs in the sitemap https://hledger.org/sitemap.xml are broken. Here is the second item in the site map:

<url>
  <loc>https://hledger.org/ACHIEVEMENTS</loc>
  <lastmod>2022-05-08T18:26:14.569Z</lastmod>
</url>

If I visit https://hledger.org/ACHIEVEMENTS, I get a 404 error. A lot of the other URLs are broken, too.

@simonmichael
Copy link
Owner

simonmichael commented May 19, 2022 via email

@Flimm
Copy link
Contributor Author

Flimm commented May 19, 2022

I'm pretty sure that the URLs in a sitemap have to be a complete URL. You can't omit the .html suffix from the URLs if that is what the URLs contain.

@Flimm
Copy link
Contributor Author

Flimm commented May 27, 2022

It looks like this particular URL has been removed from Google's index now, but some of the other URLs haven't been recrawled yet.

@Flimm
Copy link
Contributor Author

Flimm commented Jun 24, 2022

It seems like Google has recrawled most of the URLs by now. The sitemap still contains invalid URLs.

@simonmichael
Copy link
Owner

One year later... @Flimm do you still see problems that should be fixed ? We want the best possible indexing, but on the other hand without spending a ton of time.

@Flimm
Copy link
Contributor Author

Flimm commented Jun 9, 2023

Looks good to me! Thank you for fixing this. It definitely makes using hledger easier, as it's now easier to look up relevant documentation and discussion on Google.

@Flimm Flimm closed this as completed Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
seo Related to search engine indexing and discoverability
Projects
None yet
Development

No branches or pull requests

2 participants