-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Mark most UI links & buttons as rel="nofollow"
to avoid constant bot traffic
#17341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
rel="nofollow"
to avoid search enginerel="nofollow"
to avoid constant bot traffic
I'd like to take a crack at adding this in, may look for advice aside from the obvious ones incase there is a better approach. |
@raygervais that be awesome! If you have any Qs please feel free to ask, or hop in chat :) |
Is the issue still open to work after the last commit? |
In my mind it's still open, but not a The pages which should be marked as For users who want to block the bots, there could be a solution that add the |
hmm... all it would take in that instance would be an errant link to a non-"walled" link and the search bots would be back in. I guess the question is what would we like search engines to crawl? I think it would illuminating to take a look at GH's robots.txt and consider sticking nofollow on links related |
Agree to fine tune the robots.txt for blocking-bot purpose. The robots.txt of GH seems pretty simple and clear. |
I think it's important that this issue gets more attention, because without it, Gitea will (seemingly) be unusable on low-performance servers. Since last year, there's been a plethora of new bots that scrape the entire Internet to train their LLMs on. This infamously includes ClaudeBot, which is happy to send multiple complicated database queries each second, 24/7, effectively creating a persistent DDoS attack. An example query is My server gets on average five of those per second. I intentionally hire a cheap server, which really isn't all that busy, but bots really like to spam it to death. The image below show the CPU usage of my server over the last 24 hours. There's a small gap, which is when I configured my server to drop all connections that had user agent "ClaudeBot". When ClaudeBot is allowed, usage sits at roughly 90%. When ClaudeBot is denied, usage sits at around 20%. I contacted the operators of ClaudeBot, informing them of the useless queries. They replied quite fast, and courteously. Here is an excerpt I think is relevant:
Therefore, I think it is important that |
I had to close public access to my privately hosted Gitea repositories too, due to bot traffic exceeding my server's capabilities around mid 2024. I am not sure that |
Feature Description
Gitea is a magnet for search engines, which once they find an instance are very happy to follow all the links on the site, of which there are many, resulting in never ending indexer bot traffic. Among the links followed are UI buttons (star a page, sort by XYZ, select a UI language...), as well as pages that are expensive to render, but don't provide much value once indexed (blame, compare, commit, ...).
Ideally, these would not be (attempted to be) indexed.
I tried to accomplish this on my site via a
robots.txt
along the following lines, but was not exactly successful, probably because many bots don't understand the wildcard syntax:A better approach would be to render most links with the
rel="nofollow"
attribute. I'd argue this could be applied to all links, except for links toScreenshots
No response
The text was updated successfully, but these errors were encountered: