Mark most UI links & buttons as `rel="nofollow"` to avoid constant bot traffic #17341

noerw · 2021-10-17T20:00:15Z

Feature Description

Gitea is a magnet for search engines, which once they find an instance are very happy to follow all the links on the site, of which there are many, resulting in never ending indexer bot traffic. Among the links followed are UI buttons (star a page, sort by XYZ, select a UI language...), as well as pages that are expensive to render, but don't provide much value once indexed (blame, compare, commit, ...).
Ideally, these would not be (attempted to be) indexed.
I tried to accomplish this on my site via a robots.txt along the following lines, but was not exactly successful, probably because many bots don't understand the wildcard syntax:

User-agent: *
Disallow: /
Allow: /whitelisted-user
Disallow: /*/raw
Disallow: /*/commit
Disallow: /*/blame
Disallow: /*/src
Disallow: /*?lang=*

A better approach would be to render most links with the rel="nofollow" attribute. I'd argue this could be applied to all links, except for links to

landingpage
user / org
repo
issue(s) / pr(s) / release(s) / wiki / yougettheidea..

Screenshots

No response

The text was updated successfully, but these errors were encountered:

raygervais · 2021-10-17T21:45:21Z

I'd like to take a crack at adding this in, may look for advice aside from the obvious ones incase there is a better approach.
🥂

techknowlogick · 2021-10-17T21:55:00Z

@raygervais that be awesome! If you have any Qs please feel free to ask, or hop in chat :)

Brijesh-09 · 2022-06-04T09:12:03Z

Is the issue still open to work after the last commit?

wxiaoguang · 2022-06-04T14:56:43Z

In my mind it's still open, but not a good first issue.

The pages which should be marked as nofollow should be designed carefully, instead of adding the nofollow to every page/every link, unless it's sure that all users like the nofollow for every page.

For users who want to block the bots, there could be a solution that add the nofollow in the common head template, instead of changing all the links.

zeripath · 2022-06-04T15:33:42Z

hmm... all it would take in that instance would be an errant link to a non-"walled" link and the search bots would be back in.

I guess the question is what would we like search engines to crawl?

I think it would illuminating to take a look at GH's robots.txt and consider sticking nofollow on links related

wxiaoguang · 2022-06-05T02:34:53Z

Agree to fine tune the robots.txt for blocking-bot purpose. The robots.txt of GH seems pretty simple and clear.

FWDekker · 2024-07-12T12:01:30Z

I think it's important that this issue gets more attention, because without it, Gitea will (seemingly) be unusable on low-performance servers.

Since last year, there's been a plethora of new bots that scrape the entire Internet to train their LLMs on. This infamously includes ClaudeBot, which is happy to send multiple complicated database queries each second, 24/7, effectively creating a persistent DDoS attack. An example query is pulls?assignee=1&labels=150%2C149%2C147%2C152%2C144%2C146&milestone=5&poster=0&project=-1&state=open&type=all. Completely nonsensical, and the output is no different from any of the other queries it sends.

My server gets on average five of those per second. I intentionally hire a cheap server, which really isn't all that busy, but bots really like to spam it to death. The image below show the CPU usage of my server over the last 24 hours. There's a small gap, which is when I configured my server to drop all connections that had user agent "ClaudeBot". When ClaudeBot is allowed, usage sits at roughly 90%. When ClaudeBot is denied, usage sits at around 20%.

I contacted the operators of ClaudeBot, informing them of the useless queries. They replied quite fast, and courteously. Here is an excerpt I think is relevant:

Anthropic aims to limit the impact of our crawling on website operators. We respect industry standard robots.txt instructions, including any disallows for the CCBot User-Agent (we use ClaudeBot as our UAT. Documentation is available at https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler). Our crawler also respects anti-circumvention technologies and does not attempt to bypass CAPTCHAs or logins. Re: query shaping, I wanted to note that we also respect nofollow when crawling so while we do have heuristics to try to avoid crawling repeated pages, nofollow would be helpful here for signaling that to us (#17341 appears to be unloved).

Therefore, I think it is important that rel="nofollow" is added wherever it is needed. Since this issue was opened three years ago, bot traffic has grown by an order of magnitude, potentially DoS'ing Gitea instances. Though robots.txt provides a solution, I think Gitea should be updated to handle this era of LLM-scrapers by default.

WeirdConstructor · 2025-01-20T09:31:23Z

I had to close public access to my privately hosted Gitea repositories too, due to bot traffic exceeding my server's capabilities around mid 2024. I am not sure that rel="nofollow" will have any effect. It would be nice to have something like a hidden link or a link with text like "click will make site inaccessible" or something, that blocks access when clicked.
Or something more sophisticated stuff, like Anubis' sha256 proof-of-work challenge. Or anything that resembles a captcha like challenge.

wxiaoguang · 2025-04-08T14:33:07Z

In modern days, it doesn't really help, because many AI crawlers don't respect that.

See:

Setting to disable expensive endpoints for anonymous users #33966

Add a config option to block "expensive" pages for anonymous users #34024

noerw changed the title ~~Mark most UI links & buttons as rel="nofollow" to avoid search engine~~ Mark most UI links & buttons as rel="nofollow" to avoid constant bot traffic Oct 17, 2021

techknowlogick added good first issue Likely to be an easy fix hacktoberfest labels Oct 17, 2021

raygervais mentioned this issue Oct 18, 2021

adds: rel='nofollow' to all links generated by the application #17345

Closed

wxiaoguang removed the good first issue Likely to be an easy fix label Jul 24, 2023

wxiaoguang closed this as completed Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark most UI links & buttons as `rel="nofollow"` to avoid constant bot traffic #17341

Mark most UI links & buttons as `rel="nofollow"` to avoid constant bot traffic #17341

noerw commented Oct 17, 2021 •

edited

Loading

raygervais commented Oct 17, 2021

techknowlogick commented Oct 17, 2021

Brijesh-09 commented Jun 4, 2022

wxiaoguang commented Jun 4, 2022

zeripath commented Jun 4, 2022

wxiaoguang commented Jun 5, 2022

FWDekker commented Jul 12, 2024

WeirdConstructor commented Jan 20, 2025

wxiaoguang commented Apr 8, 2025

Mark most UI links & buttons as rel="nofollow" to avoid constant bot traffic #17341

Mark most UI links & buttons as rel="nofollow" to avoid constant bot traffic #17341

Comments

noerw commented Oct 17, 2021 • edited Loading

Feature Description

Screenshots

raygervais commented Oct 17, 2021

techknowlogick commented Oct 17, 2021

Brijesh-09 commented Jun 4, 2022

wxiaoguang commented Jun 4, 2022

zeripath commented Jun 4, 2022

wxiaoguang commented Jun 5, 2022

FWDekker commented Jul 12, 2024

WeirdConstructor commented Jan 20, 2025

wxiaoguang commented Apr 8, 2025

Mark most UI links & buttons as `rel="nofollow"` to avoid constant bot traffic #17341

Mark most UI links & buttons as `rel="nofollow"` to avoid constant bot traffic #17341

noerw commented Oct 17, 2021 •

edited

Loading