Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely poor docs.python.org SEO performance. #1691

Open
nelhage opened this issue Nov 27, 2020 · 31 comments
Open

Extremely poor docs.python.org SEO performance. #1691

nelhage opened this issue Nov 27, 2020 · 31 comments

Comments

@nelhage
Copy link

nelhage commented Nov 27, 2020

Describe the bug
docs.python.org has atrocious search performance on Google. It's so bad that I suspect Google is actively downranking it for some reason.

To Reproduce
Search Google for virtually any Python documentation topic. The ones that drove me here were searches for [python set] and [python shuffle list].

Expected behavior
docs.python.org is the authoritative source for Python documentation on the web. I expect to find relevant results on docs.python.org somewhere on the first page of Google search results.

Instead, I find the opposite. For [python set] I expect to find https://docs.python.org/3/library/stdtypes.html#set
somewhere on the first page of results, but instead, the only python.org result I see is for the long-deprecated
Python 2 sets module --
https://docs.python.org/2/library/sets.html

For [python shuffle list], neither https://docs.python.org/3/library/random.html nor any other python.org result shows up anywhere on the first page.

Screenshots
image
image

Additional context
These results are egregious enough to make me suspect you're
being actively downranked for some reason. This isn't a request for general SEO optimization -- although that'd be a great project if someone has the interest -- but for a domain admin to try to use Google's search console (https://developers.google.com/search) to investigate if there's something egregiously wrong with an easy fix.

@ewdurbin
Copy link
Member

ewdurbin commented Dec 7, 2020

@JulienPalard Any ideas here? Do you have/need access to the google search console for docs?

@JulienPalard
Copy link
Member

I do have access to the search console, but I don't think I'm of any help from an SEO point of view.

The search console is telling us our "mobile ergonomy is bad", it's probably just that? I heard Google is ranking mobile pages first.

We have a PR opened to make the doc responsive since may here: python/python-docs-theme#46 I don't know if it could help.

@JulienPalard
Copy link
Member

Today, Google, for the python set search, displays a big box linking to w3schools first ☹ and https://docs.python.org/fr/3/tutorial/datastructures.html 2nd, and no links to our doc for python shuffle list.

Unlink many spammy websites, docs.python.org does not have a page dedicated to those topics, so they probably win because it's in the title of their page, and the more specific we go, the more they'll win.

For example, for python control flow we land 1st because our tutorial have a page with control flow in the title and URL (https://docs.python.org/3/tutorial/controlflow.html).

On the other hand, for python remove all duplicate items from a list, stackoverflow is obviously first, and spammy w3schools comes 2nd with an exact title match too, and bad examples, followed by many spammers (dumps of StackOverflow DB I bet for most of them).

An obvious way would be to write a page for all those topics, but user-generated content is equally good at this job (stack overflow typically), with less contributor bottleneck. I'd leave this question to the to-be-created new doc sig.

Sadly if we don't do it, spammers will always get first, with bad quality or outdated content, just so they can display their ads.

@di
Copy link
Member

di commented Apr 21, 2022

Another issue is that the search engines often seem to prefer docs for older releases than for newer releases, e.g.:

image

The missing description is probably also hurting us. The 'learn why' link goes to https://support.google.com/webmasters/answer/7489871?hl=en

@jacobian
Copy link
Contributor

That other issue you mentioned, @di, is something I've noticed about the Django docs as well -- google often seems to rank older versions much higher. I wonder if that problem could be solved with a rel=cannonical?

@alex
Copy link
Member

alex commented Apr 21, 2022

FWIW the Rust packages docs also had this problem, and seemed to have solved it, but I can't remember how (and ironically, googling it is useless) -- I don't see rel=canonical on docs.rs pages so there may be another tactic in addition.

@ericholscher
Copy link

ericholscher commented Apr 21, 2022

It looks like Python is indexing old versions, and they are disallowed in the robots.txt: https://docs.python.org/robots.txt -- at least the link that @di is pointing at. That is probably the first thing I'd try to fix them, but agreed that canonicalization can sometimes help. Google is quite fickle though, and hard to understand how to fix this. We've tried a number of different things.

We have lots more tips here: https://docs.readthedocs.io/en/stable/guides/technical-docs-seo-guide.html.

I'm guessing the Python docs history of non-mobile friendly design has probably hurt it a lot over time. I believe that's fixed now though.

The first step I'd do is probably add canonical links to /3/, since I believe that is the "canonical" version. (It looks like y'all are already doing that though, so yea, the unknowable Google indexing conundrum lives on).

The next step is definitely diving into Google Search Console for what it says there.

@davidfischer
Copy link

davidfischer commented Apr 21, 2022

The next step is definitely diving into Google Search Console for what it says there.

Just to echo Eric, this should definitely be the next step. Whoever has access to y'alls search console will get a lot of details about what Google is doing. For example, there may be something Google sees as spam or duplicated and they've taken some downranking action against the domain.

I looked at the str.strip example from @di's screenshot. It is hard for the Python documentation to compete with a site that has a whole page about the str.strip method with examples especially when you consider that the Python docs have a single mention of the method in the middle of a 2,000 word page. However, there's a few things that would help. I noticed that while some new versions do set the canonical URL, the 3.4 docs do not have rel=canonical on them (look here). This is probably why they're continuing to show up in results. You might also need to let the robots through temporarily so they can index that change once you change it.

A larger (and harder to fix) issue is that a lot of the Python documentation isn't written with search engines in mind. I would tackle issues with robots, sitemaps, and search console first, but this might be worth a look afterwards. Just to give a couple concrete examples:

  • The term regex barely appears on the regular expressions module page (not in great context and not in the first paragraph) but does on the howto page. This is probably why the latter ranks better when searching for something like "python regex". It's a bunch of work, but going through your core docs pages and making sure the title and first 2-3 sentences of the main content describe the page pretty well is probably worth it. Y'all don't use meta descriptions, so those 2-3 sentences are what will show up on search engine results page.
  • Some pages like the built-in types page (where str.strip appears) are going to really struggle with ranking because they cover a lot of ground on a lot of different topics. Instead, I'd consider having a page on iterators, a page on sets, a page on boolean types, etc.

@davidfischer
Copy link

FWIW the Rust packages docs also had this problem, and seemed to have solved it, but I can't remember how (and ironically, googling it is useless) -- I don't see rel=canonical on docs.rs pages so there may be another tactic in addition.

Probably this ticket: rust-lang/rust#12466

@JulienPalard
Copy link
Member

Looks like we have canonicals links to /3/ since 3.5:

$ for v in $(seq 10); do echo $v $(curl  https://docs.python.org/3.$v/library/stdtypes.html | grep canonical); done
1
2
3
4
5 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
6 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
7 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
8 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
9 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
10 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />

I may probably be fixed with some proper sed-fu, but is it worth it as previous versions are denied by robots.txt:

$ curl https://docs.python.org/robots.txt
Sitemap: https://docs.python.org/sitemap.xml

# Prevent development and old documentation from showing up in search results.
User-agent: *
Disallow: /dev
Disallow: /release

# Disallow EOL versions
Disallow: /2/
Disallow: /2.0/
Disallow: /2.1/
Disallow: /2.2/
Disallow: /2.3/
Disallow: /2.4/
Disallow: /2.5/
Disallow: /2.6/
Disallow: /2.7/
Disallow: /3.0/
Disallow: /3.1/
Disallow: /3.2/
Disallow: /3.3/
Disallow: /3.4/

@JulienPalard
Copy link
Member

It is hard for the Python documentation to compete with a site that has a whole page about the str.strip

I totally agree, I don't think we can beat those, to the point I wonder if we should do the same: for the most searched functions, to build a dedicated "howto" or "tutorial", with up-to-date good practices, examples, and so on.

But I don't feel my english level is enough to start this kind of project ☹

@davidfischer
Copy link

I may probably be fixed with some proper sed-fu, but is it worth it as previous versions are denied by robots.txt:

I would definitely fix it. This will stop version 3.4 from showing up in Google's results. You may have to open up the robots.txt for a while after making the change but I'm not sure there.

As to whether to take on a huge docs reformatting/rework projects, it's a terrifying never-ending project of incremental improvement. I'd fix all the concrete easy things (like 3.4 docs showing up in search engines) first.

@gpshead
Copy link
Member

gpshead commented Feb 8, 2023

Any update on stuffing the older docs that lack rel=canonical information with canonical tags? 3.3 and such are still showing up on top in many searches such as Googling for zip site:docs.python.org

@hugovk
Copy link
Member

hugovk commented Feb 9, 2023

@JulienPalard Please could you use your sed-fu?

@JulienPalard
Copy link
Member

JulienPalard commented Feb 9, 2023

@JulienPalard Please could you use your sed-fu?

I can propose:

docsbuild@docs:/srv/docs.python.org/release/3.4.10$ find -name '*.html' | while read -r file; do sed -i '/link rel="shortcut icon/{s|$|\n    <link rel="canonical" href="https://docs.python.org/3/'"$file"'" />|;s|/\./|/|g}' "$file"; done

Followed by:

curl -XPURGE https://docs.python.org/3.4/{$(find -name '*.html' | sed 's|^./||g' | tr '\n' ,)}
curl -XPURGE https://docs.python.org/3.4/{$(find -name '*.html' | sed 's|^./||g' | grep index.html | sed 's/index.html//g' | tr '\n' ,)}

to clean the cache.

I just passed it for 3.4, tell me I should go ahead on 3.0, 3.1, 3.2, and 3.3 or if you see an issue.

$ curl  https://docs.python.org/3.4/library/ | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/library/index.html" />
$ curl  https://docs.python.org/3.4/library/stdtypes.html | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
$ curl  https://docs.python.org/3.4/library/functions.html | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/library/functions.html" />

@hugovk
Copy link
Member

hugovk commented Feb 11, 2023

Canonical looks good at https://docs.python.org/3.4/library/, as does https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Fdocs.python.org%2F3.4%2Flibrary%2F

What do others say? Good to do 3.0 - 3.3?

@gpshead
Copy link
Member

gpshead commented Feb 11, 2023

makes sense, go ahead for the earlier 3s as well.

@JulienPalard
Copy link
Member

$ for v in $(seq 10); do echo $v $(curl  https://docs.python.org/3.$v/library/stdtypes.html | grep canonical); done
1 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
2 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
3 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
4 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
5 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
6 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
7 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
8 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
9 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />
10 <link rel="canonical" href="https://docs.python.org/3/library/stdtypes.html" />

@davidfischer
Copy link

This is definitely a step in the right direction, but Google hasn't indexed it yet. I'm not sure whether you need to open up the robots.txt temporarily or you might see if you could submit the docs through Google Search Console for reindexing first.

I did verify that the canonical tag is on that page so it should be picked up eventually.

image

@di
Copy link
Member

di commented Mar 7, 2023

I agree they should be temporarily removed from robots.txt -- perhaps even permanently? I think that blocking them was likely a misguided attempt to remove these from the search results. The crawler should deprioritize them based on the canonical link.

That said, I'm not seeing how the current robots.txt is being generated, does anyone know?

@di
Copy link
Member

di commented Mar 7, 2023

@JulienPalard
Copy link
Member

@di I just removed them: python/docsbuild-scripts@c49181f

@di
Copy link
Member

di commented Mar 7, 2023

@JulienPalard Thanks! Let me know when that's deployed and I can submit them for reindexing!

@di
Copy link
Member

di commented Mar 8, 2023

I see that the robots.txt is updated and submitted these for indexing. However, when looking at the search console for the sitemap, I found that Google is not indexing any of the URLs included:

image

This is because the sitemap provides a URL like https://docs.python.org/3/ but this has a canonical link to https://docs.python.org/3/index.html which isn't in the sitemap:

image

I think we probably need to a) make sure the canonical URLs are in the sitemap and b) put many more URLs into the sitemap (possibly, every URL we have). Right now, the sitemap only includes:

https://docs.python.org/3.12/
https://docs.python.org/3.11/
https://docs.python.org/3.10/
https://docs.python.org/3.9/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3/

And doesn't include any older Python versions or any sub-pages.

@di
Copy link
Member

di commented Mar 8, 2023

Additionally, here's a breakdown with the top reasons why pages aren't being indexed:

image

@di
Copy link
Member

di commented Mar 8, 2023

Also, to resolve the issue in OP, where https://docs.python.org/2/library/sets.html is the 2nd result for https://www.google.com/search?q=python+set, I think we probably need to update canonical tags and remove robots.txt blocking for all EOL versions.

@di
Copy link
Member

di commented Mar 19, 2023

Another thing: it seems like for translations, our canonical URLs should be pointing to the translated versions of these pages with:

<link rel=”alternate” hreflang="a-different-language" ...>

https://developers.google.com/search/blog/2010/09/unifying-content-under-multilingual

@di
Copy link
Member

di commented Mar 19, 2023

It seems like many 3.x pages are still missing canonical tags as well:

$ curl -s https://docs.python.org/3/library/email.examples.html | grep canonical
    <link rel="canonical" href="https://docs.python.org/3/library/email.examples.html" />
    
$ curl -s https://docs.python.org/3.4/library/email-examples.html | grep canonical

$

@JulienPalard
Copy link
Member

It seems like many 3.x pages are still missing canonical tags as well

Ohhh interesting!

Those lost their canonical tag automaticaly because it points to a 404, because it was generated by dumb sed instead of human knowing about email-examples.html being renamed email.examples.html.

So we have to find all pages like this...

$ cd 3.4/
$ grep -L 'rel="canonical"' **/*.html
howto/webservers.html
library/_dummy_thread.html
library/asyncio-eventloops.html
library/binhex.html
library/dummy_threading.html
library/email-examples.html
library/email.util.html
library/formatter.html
library/fpectl.html
library/macpath.html
library/misc.html
library/othergui.html
library/parser.html
library/symbol.html
library/undoc.html
using/scripts.html

and fix them manually... at the end some will still not have a 'canonical', or at least not one to /3/ if they don't appear on /3/.

@jxu
Copy link

jxu commented Sep 25, 2024

I don't believe in Google conspiracy - they're lower ranked simply because they're BAD and thus people don't want to use them. I felt this way when I first started learning python, and I still do even though I know most of the language basics.

Take the .strip() string method for example.
image

It's 2024 and it still says the very unfriendly "No Information Available", while linking to 3.4 docs. So I go to https://docs.python.org/3.4/library/stdtypes.html and what do I get? A gigantic page titled "4. Built-in Types". Wait, I thought I was looking for strip!
image

So how do I get there? Well first I need to know that the function will be under string methods and is actually called str.strip(). If I go to 4.7. Text Sequence Type - str (why not just call it String Type str??) and to String Methods, I have to page down 8 screens to get to the function (next to nearly useless functions str.swapcase() and str.title()).

If you want to be user-friendly, you should have one page per TYPE, per CLASS, and in the best case per METHOD. If I google numpy sin, I get a page https://numpy.org/doc/stable/reference/generated/numpy.sin.html that is exactly what I'm looking for! Nothing more, nothing less. It tells me what the parameters and return values are, and there's no guesswork with expected types. If I look up "javascript array push", I get a whole MDN page https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/push. If I look up "python list extend", I get ONE LINE of information. What's the time complexity? Can it fail? Does it modify in-place or return a new list? There has to be more info.

I see this term maybe overused on Stack Overflow but this issue is an "XY Problem". The issue asks why the docs SEO is atrocious, when it should be asking why the docs themselves are atrocious. (There's speculation from large youtube creators that its infamous "algorithm" is the same way - if you want to be consistently recommended, you have to be worthwhile and quality first.) It's mind-boggling to me that a language as massive and successful as python would never improve its languishing documentation.

@wpdevelopment11
Copy link

wpdevelopment11 commented Oct 3, 2024

Instead of disallowing docs for old versions in robots.txt you should add noindex tag to them. Noindex tag should be added to all versions of docs, except the latest one.

For example, how Go does it:

docs.rs uses essentially the same approach. But instead of a noindex tag they add X-Robots-Tag: noindex header in response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests