-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seemingly duplicate results in search engines like Google #66
Comments
Great suggestions, thanks. I'll work on this if no-one beats me to it. |
I believe this is done. Changes:
A test search: https://www.google.com/search?q=hledger+posting
|
I think search results (even just Google's) may take a long time to clean up, and probably it would be wise to create a sitemap.xml to help that along. Generating that is not supported in released mdbook just yet; I'd welcome suggestions on what to put in it. |
Basic sitemap created, google reindexing pending. |
That looks great! Thank you. It occurred to me that would be better for SEO purposes to have stable URLs that are included in the Google index. What if https://hledger.org/stable/hledger.html worked (and didn't redirect anywhere)? That way, the A URL like https://hledger.org/1.25/hledger.html collects Google juice, but at some point it gets wasted, when a new version comes out. It gets wasted when the old version gets the |
Good thoughts. My intent was always to have the easy https://hledger.org/hledger.html (hledger-ui.html, hledger-web.html) be the stable URLs for the manuals of the current release. IIRC previously this was done with symlinks or copies, and both URLs existed on the web. With the latest changes, /hledger.html is a redirect to /CURRENTVERSION/hledger.html, ie still a sort of "stable URL", and I think I saw today in google search console that they are correctly guessing /hledger.html as the canonical URL. /hledger.html is missing from the new sitemap, though, so I should maybe add it there. Though with reindexing still pending, it's a little hard to be sure what's what. |
I'm slightly baffled. Taking https://hledger.org/1.0/hledger.html as an example old manual page, it now has the noindex tag, and a sitemap not including it was successfully submitted. After several days its Coverage status remains "Indexed, not submitted in sitemap / URL is on Google / It can appear in Google Search results (if not subject to a manual action or removal request)". Its last crawl date is.. May 1, 2022. When I request reindexing google says it can't be indexed because of the noindex tag. Docs say not to request removal, and to rely on the sitemap or noindex tag + reindexing instead. So... keep waiting and it will happen ? |
I had a look at https://hledger.org/1.0/hledger.html . It seems that GoogleBot last crawled this page on 4 May 2022. That was before the It's worthwhile distinguishing between the concept of crawling and indexing. We want Googlebot to crawl these pages, but we don't want it to index them. You said the tool informed you that the page "can't be indexed because of the noindex tag". I think that's the message we want and expect. I'm not sure why the page is still in the Google search results if it can't be indexed. I also noticed that most of the URLs in the sitemap https://hledger.org/sitemap.xml are broken. Here is the second item in the site map: <url>
<loc>https://hledger.org/ACHIEVEMENTS</loc>
<lastmod>2022-05-08T18:26:14.569Z</lastmod>
</url> If I visit https://hledger.org/ACHIEVEMENTS, I get a 404 error. A lot of the other URLs are broken, too. |
It seemed to me that it's normal for sitemap.xml to omit the .html suffix, is that wrong ?
|
I'm pretty sure that the URLs in a sitemap have to be a complete URL. You can't omit the |
It looks like this particular URL has been removed from Google's index now, but some of the other URLs haven't been recrawled yet. |
It seems like Google has recrawled most of the URLs by now. The sitemap still contains invalid URLs. |
One year later... @Flimm do you still see problems that should be fixed ? We want the best possible indexing, but on the other hand without spending a ton of time. |
Looks good to me! Thank you for fixing this. It definitely makes using hledger easier, as it's now easier to look up relevant documentation and discussion on Google. |
Let's say I'm trying to find some documentation about postings. I open Google and search for "hledger posting". These are the results that I see:
As you can see, the first results look similar. The first three have the same title "journal manual - hledger", and they have the same breadcrumb "https://hledger.org › journal". The links of the first three results are:
(At the time of writing, the latest version of hledger is 1.25)
As you can see, these results are basically the same page, but for different versions of hledger. As you can imagine, this can make it harder to find what I was looking for.
Here are some suggestions:
<title>
tag of these pages. That way, the titles on Google will look different, like "journal manual - hledger 1.0", "journal manual - hledger 1.13", and so on.<meta name="robots" content="noindex">
for older versions of hledger only, or consider removing older versions altogetherThe text was updated successfully, but these errors were encountered: