Ruby gem to make sure that an IP really belongs to a bot, typically a search engine.
Suppose you have a Web request and you would like to check it is not diguised:
bot = Legitbot.bot(userAgent, ip)
bot
will be nil
if no bot signature was found in the User-Agent
.
Otherwise, it will be an object with methods
bot.detected_as # => :google
bot.valid? # => true
bot.fake? # => false
Sometimes you already know which search engine to expect. For example, you might be using rack-attack:
Rack::Attack.blocklist("fake Googlebot") do |req|
req.user_agent =~ %r(Googlebot) && Legitbot::Google.fake?(req.ip)
end
Or if you do not like all those ghoulish crawlers stealing your content, evaluating it and getting ready to invade your site with spammers, then block them all:
Rack::Attack.blocklist 'fake search engines' do |request|
Legitbot.bot(request.user_agent, request.ip)&.fake?
end
Semantic versioning with the following clarifications:
- MINOR version is incremented when support for new bots is added.
- PATCH version is incremented when validation logic for a bot changes (IP list updated, for example).
- Ahrefs
- Amazonbot
- Amazon AdBot
- Applebot
- Baidu spider
- Bingbot
- BLEXBot (WebMeUp)
- DataForSEO
- DuckDuckGo bot
- Google crawlers
- IAS
- OpenAI GPTBot
- Oracle Data Cloud Crawler
- Marginalia
- Meta / Facebook Web crawlers
- Petal search engine
- Twitterbot, the list of IPs is in the Troubleshooting page
- Yandex robots
- You.com
Apache 2.0
- Play Framework variant in Scala: play-legitbot
- Article When (Fake) Googlebots Attack Your Rails App
- Voight-Kampff is a Ruby gem that
detects bots by
User-Agent
- crawler_detect is a Ruby gem and
Rack middleware to detect crawlers by few different request headers, including
User-Agent
- Project Honeypot's http:BL can not only classify IP as a search engine, but also label them as suspicious and reports the number of days since the last activity. My implementation of the protocol in Scala is here.
- CIDRAM is a PHP routing manager with built-in support to validate bots.