Skip to content

Mark most UI links & buttons as rel="nofollow" to avoid constant bot traffic #17341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
noerw opened this issue Oct 17, 2021 · 9 comments
Closed

Comments

@noerw
Copy link
Member

noerw commented Oct 17, 2021

Feature Description

Gitea is a magnet for search engines, which once they find an instance are very happy to follow all the links on the site, of which there are many, resulting in never ending indexer bot traffic. Among the links followed are UI buttons (star a page, sort by XYZ, select a UI language...), as well as pages that are expensive to render, but don't provide much value once indexed (blame, compare, commit, ...).
Ideally, these would not be (attempted to be) indexed.
I tried to accomplish this on my site via a robots.txt along the following lines, but was not exactly successful, probably because many bots don't understand the wildcard syntax:

User-agent: *
Disallow: /
Allow: /whitelisted-user
Disallow: /*/raw
Disallow: /*/commit
Disallow: /*/blame
Disallow: /*/src
Disallow: /*?lang=*

A better approach would be to render most links with the rel="nofollow" attribute. I'd argue this could be applied to all links, except for links to

  • landingpage
  • user / org
  • repo
  • issue(s) / pr(s) / release(s) / wiki / yougettheidea..

Screenshots

No response

@noerw noerw changed the title Mark most UI links & buttons as rel="nofollow" to avoid search engine Mark most UI links & buttons as rel="nofollow" to avoid constant bot traffic Oct 17, 2021
@raygervais
Copy link

I'd like to take a crack at adding this in, may look for advice aside from the obvious ones incase there is a better approach.
🥂

@techknowlogick
Copy link
Member

@raygervais that be awesome! If you have any Qs please feel free to ask, or hop in chat :)

@Brijesh-09
Copy link

Is the issue still open to work after the last commit?

@wxiaoguang
Copy link
Contributor

In my mind it's still open, but not a good first issue.

The pages which should be marked as nofollow should be designed carefully, instead of adding the nofollow to every page/every link, unless it's sure that all users like the nofollow for every page.

For users who want to block the bots, there could be a solution that add the nofollow in the common head template, instead of changing all the links.

@zeripath
Copy link
Contributor

zeripath commented Jun 4, 2022

hmm... all it would take in that instance would be an errant link to a non-"walled" link and the search bots would be back in.

I guess the question is what would we like search engines to crawl?

I think it would illuminating to take a look at GH's robots.txt and consider sticking nofollow on links related

@wxiaoguang
Copy link
Contributor

Agree to fine tune the robots.txt for blocking-bot purpose. The robots.txt of GH seems pretty simple and clear.

@wxiaoguang wxiaoguang removed the good first issue Likely to be an easy fix label Jul 24, 2023
@FWDekker
Copy link
Contributor

I think it's important that this issue gets more attention, because without it, Gitea will (seemingly) be unusable on low-performance servers.

Since last year, there's been a plethora of new bots that scrape the entire Internet to train their LLMs on. This infamously includes ClaudeBot, which is happy to send multiple complicated database queries each second, 24/7, effectively creating a persistent DDoS attack. An example query is pulls?assignee=1&labels=150%2C149%2C147%2C152%2C144%2C146&milestone=5&poster=0&project=-1&state=open&type=all. Completely nonsensical, and the output is no different from any of the other queries it sends.

My server gets on average five of those per second. I intentionally hire a cheap server, which really isn't all that busy, but bots really like to spam it to death. The image below show the CPU usage of my server over the last 24 hours. There's a small gap, which is when I configured my server to drop all connections that had user agent "ClaudeBot". When ClaudeBot is allowed, usage sits at roughly 90%. When ClaudeBot is denied, usage sits at around 20%.
CPU usage on my server.

I contacted the operators of ClaudeBot, informing them of the useless queries. They replied quite fast, and courteously. Here is an excerpt I think is relevant:

Anthropic aims to limit the impact of our crawling on website operators. We respect industry standard robots.txt instructions, including any disallows for the CCBot User-Agent (we use ClaudeBot as our UAT. Documentation is available at https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler). Our crawler also respects anti-circumvention technologies and does not attempt to bypass CAPTCHAs or logins. Re: query shaping, I wanted to note that we also respect nofollow when crawling so while we do have heuristics to try to avoid crawling repeated pages, nofollow would be helpful here for signaling that to us (#17341 appears to be unloved).

Therefore, I think it is important that rel="nofollow" is added wherever it is needed. Since this issue was opened three years ago, bot traffic has grown by an order of magnitude, potentially DoS'ing Gitea instances. Though robots.txt provides a solution, I think Gitea should be updated to handle this era of LLM-scrapers by default.

@WeirdConstructor
Copy link

I had to close public access to my privately hosted Gitea repositories too, due to bot traffic exceeding my server's capabilities around mid 2024. I am not sure that rel="nofollow" will have any effect. It would be nice to have something like a hidden link or a link with text like "click will make site inaccessible" or something, that blocks access when clicked.
Or something more sophisticated stuff, like Anubis' sha256 proof-of-work challenge. Or anything that resembles a captcha like challenge.

@wxiaoguang
Copy link
Contributor

In modern days, it doesn't really help, because many AI crawlers don't respect that.

See:

Setting to disable expensive endpoints for anonymous users #33966

Add a config option to block "expensive" pages for anonymous users #34024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants