Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cronjobs: inject canonical URLs into older manual pages (SEO) #1241

Merged
merged 4 commits into from
Dec 2, 2024

Conversation

neteler
Copy link
Member

@neteler neteler commented Nov 12, 2024

The GRASS GIS manual pages of the different versions have been published for a long time with a difficult to understand concept of being invisible, redirected or shown, which also strongly affects the search engine ranking.

SEO: Without indication of "canonical" URLs different versions wipe each out out in search engines. Canonical tags help consolidate duplicate or similar content by specifying the preferred version of a page, ensuring search engines index and rank the desired URL while avoiding duplicate content issues.

This PR changes the cronjob scripts to

  • inject "grass-stable" as the "canonical" into older manual pages under versioned URL
  • inject "grass-devel" as the "canonical" into the development manual pages under versioned URL

Like this no "duplicate content" from a SEO perspective should occur.

Also robots.txt is updated to reactivate the manual pages of old GRASS GIS versions (which now contain "grass-stable" as the canonical).

Additionally, rewrite red box injection to avoid globbing error argument list too long old versions of libpython manual.

Fixes OSGeo/grass#4579

The GRASS GIS manual pages of the different versions have been published for a long time with a difficult to understand concept of being invisible, redirected or shown, which also strongly affects the search engine ranking.

SEO: Without indication of "canonical" URLs different versions wipe each out out in search engines. Canonical tags help consolidate duplicate or similar content by specifying the preferred version of a page, ensuring search engines index and rank the desired URL while avoiding duplicate content issues.

This PR changes the cronjob scripts to
- inject "grass-stable" as the "canonical" into older manual pages under versioned URL
- inject "grass-devel" as the "canonical" into the development manual pages under versioned URL

Like this no "duplicate content" from a SEO perspective should occur.

Also `robots.txt` is updated to reactivate the manual pages of old GRASS GIS versions (which now contain "grass-stable" as the canonical).

Fixes OSGeo/grass#4579
@neteler neteler added manual Documentation related issues CI Continuous integration labels Nov 12, 2024
@neteler neteler self-assigned this Nov 12, 2024
@neteler
Copy link
Member Author

neteler commented Nov 12, 2024

Note: these files are now deployed on grass.osgeo.org for testing.

@echoix
Copy link
Member

echoix commented Nov 12, 2024

Wouldn't the grass-devel and grass-stable primary content be rather similar, thus being potentially penalized as duplicates?

How does this method handle pages that don't exist in later versions, or are renamed/moved?
It doesn't happen often, but we had some every now and then.

… to point to "stable" manual (rather than "devel")
@neteler
Copy link
Member Author

neteler commented Nov 13, 2024

Wouldn't the grass-devel and grass-stable primary content be rather similar, thus being potentially penalized as duplicates?

Very good point.
I have changed it in 9824486 to inject in the 8.5 versioned and the "grass-devel" manual sections the "canonical" to point to "stable" rather than to "devel".

Deployed update on grass.osgeo.org, triggered cronjob and told Google Search about it.

How does this method handle pages that don't exist in later versions, or are renamed/moved? It doesn't happen often, but we had some every now and then.

A few of them are handled with redirects in Apache. I would not know any other method.

@neteler
Copy link
Member Author

neteler commented Nov 19, 2024

Too bad, now the building with cron_grass_preview_build_binaries.sh on the Debian grass.osgeo.org server is broken after the MD merge:

...
Parsing <v.what.strds.timestamp>... SUCCESS
Parsing <wx.metadata>... FAILED
Parsing <wx.mwprecip>... FAILED
Parsing <wx.stream>... FAILED
Parsing <wx.wms>... FAILED
+ cp /home/neteler/.grass8/addons/modules.xml /var/www/code_and_data/addons/grass8/modules.xml
+ export ARCH
+ export ARCH_DISTDIR=/home/neteler/src//main/dist.x86_64-pc-linux-gnu
+ export GISBASE=/home/neteler/src//main/dist.x86_64-pc-linux-gnu
+ export VERSION_NUMBER=8.5
+ python3 /home/neteler/src//main/man/build_keywords.py /var/www/code_and_data/grass85/manuals/ /var/www/code_and_data/grass85/manuals/addons/
Traceback (most recent call last):
  File "/home/neteler/src//main/man/build_keywords.py", line 202, in <module>
    build_keywords("md")
  File "/home/neteler/src//main/man/build_keywords.py", line 68, in build_keywords
    from build_md import (
  File "/home/neteler/src/main/man/build_md.py", line 264, in <module>
    man_dir = os.path.join(os.environ["MDDIR"], "source")
  File "/usr/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'MDDIR'

Can @landam help?

@neteler
Copy link
Member Author

neteler commented Nov 22, 2024

Too bad, now the building with cron_grass_preview_build_binaries.sh on the Debian grass.osgeo.org server is broken after the MD merge:

Bugfix PR: OSGeo/grass#4739

@echoix
Copy link
Member

echoix commented Nov 25, 2024

Is everything clear for this one now?

@neteler neteler requested a review from wenzeslaus November 27, 2024 15:58
neteler added a commit to neteler/grass-website that referenced this pull request Nov 27, 2024
So far https://grass.osgeo.org/sitemap.xml showed the versioned manual pages which is unhelpful in terms of consolidating search engine results for manuals.
In the past months we were penalized by "duplicate content".

For an overview, see OSGeo/grass#4579

For efforts to address this situation, see

- OSGeo/grass-addons#1168
- OSGeo/grass-addons#1241

This PR changes the URL in `sitemap.xml` from versioned manual URLs to grass-stable/grass-devel in order to complete the other PRs.
neteler added a commit to OSGeo/grass-website that referenced this pull request Nov 27, 2024
So far https://grass.osgeo.org/sitemap.xml showed the versioned manual pages which is unhelpful in terms of consolidating search engine results for manuals.
In the past months we were penalized by "duplicate content".

For an overview, see OSGeo/grass#4579

For efforts to address this situation, see

- OSGeo/grass-addons#1168
- OSGeo/grass-addons#1241

This PR changes the URL in `sitemap.xml` from versioned manual URLs to grass-stable/grass-devel in order to complete the other PRs.
@neteler
Copy link
Member Author

neteler commented Nov 30, 2024

Is everything clear for this one now?

Almost. I am at time fixing the red box injection into the old libpython manual pages which fails with the globbing related error argument list too long due to recursive find operations.
I am currently testing a different approach on the server.

@neteler
Copy link
Member Author

neteler commented Dec 1, 2024

@wenzeslaus from my side this PR is now complete.

@neteler neteler merged commit ad36f57 into OSGeo:grass8 Dec 2, 2024
7 checks passed
@neteler neteler deleted the cronjobs_manual_canonical branch December 2, 2024 20:41
@neteler
Copy link
Member Author

neteler commented Dec 2, 2024

cronjob files including cron_job_list_grass updated and deployed on server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous integration manual Documentation related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] docs: Fix search engine ranking of manual pages (SEO)
2 participants