Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(plugin): Add new plugin ua-restriction for bot spider restriction #4587

Merged
merged 14 commits into from
Jul 21, 2021
177 changes: 177 additions & 0 deletions apisix/plugins/bot-restriction.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
--
-- Licensed to the Apache Software Foundation (ASF) under one or more
-- contributor license agreements. See the NOTICE file distributed with
-- this work for additional information regarding copyright ownership.
-- The ASF licenses this file to You under the Apache License, Version 2.0
-- (the "License"); you may not use this file except in compliance with
-- the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--
local ipairs = ipairs
local core = require("apisix.core")
local stringx = require('pl.stringx')
local type = type
local str_strip = stringx.strip
local re_find = ngx.re.find

local MATCH_NONE = 0
local MATCH_ALLOW = 1
local MATCH_DENY = 2
local MATCH_BOT = 3

local lrucache_useragent = core.lrucache.new({ ttl = 300, count = 1024 })

local schema = {
type = "object",
properties = {
message = {
type = "string",
minLength = 1,
maxLength = 1024,
default = "Not allowed"
},
whitelist = {
type = "array",
minItems = 1
},
blacklist = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about allowlist and blocklist, we should avoid using these sensitive words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

type = "array",
minItems = 1
},
},
additionalProperties = false,
}

local plugin_name = "bot-restriction"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bot-restriction is confusing. It just checks the UA. What about renaming it to ua-restriction?

Copy link
Contributor Author

@arthur-zhang arthur-zhang Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so,this plugin is for spider detection,and include most common spider ua.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real bot-detection in the industry area is not just spider detection and UA check.

Copy link
Contributor Author

@arthur-zhang arthur-zhang Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin is for the common BaiduSpider、360Spider and some dev tools detection. We have to use the product of professional security company to do the feature you mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found similar function in krakend

https://www.krakend.io/docs/throttling/botdetector/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't believe the nonsense bullshit. This way can only kick out script boys. But for the real hacker, checking the UA is definitely not enough.

We have to use the product of professional security company to do the feature you mentioned.

That's it. A real bot detection system should be as professional as them instead of just doing UA checks and declaring this solves the problem. People will laugh at APISIX. We should provide a mechanism that the professional security company can use to build a gateway, but not declare we are a security gateway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it more suitable to change plugin name to ua-restriction and remove the hard-coded ua list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it


local _M = {
version = 0.1,
priority = 2999,
name = plugin_name,
schema = schema,
}

-- List taken from https://github.com/ua-parser/uap-core/blob/master/regexes.yaml
local well_known_bots = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not hard code the UA list, as it could not be updated in time. It would be better to provide a mechanism but not the tool to check the UA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most spider bot UA contains “bot” or “spider” or “crawler” or dev http client, the regex expression covers most common case,it will not update very frequently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it will be updated, isn't it? Better to require the user to choose their list instead of shipping a stale one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can update the list in whitelist or blacklist configuration to support the ua not listed in our package

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why ship a stale one and ask the user to update it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can support other plugin for client restrction like ua, ip, or other infomartion. This plugin is just for users not want to add bunch of ua regex rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plugin is only for simplify usage.
If user do not want to use the plugin, they can use other restriction plugin and modify the rules the want.

[[(Pingdom\.com_bot_version_)(\d+)\.(\d+)]],
[[(facebookexternalhit)/(\d+)\.(\d+)]],
[[Google.{0,50}/\+/web/snippet]],
[[(NewRelicPinger)/(\d+)\.(\d+)],
[[\b(Boto3?|JetS3t|aws-(?:cli|sdk-(?:cpp|go|java|nodejs|ruby2?|dotnet-(?:\d{1,2}|c]]
.. [[ore)))|s3fs)/(\d+)\.(\d+)(?:\.(\d+)|)]],
[[ PTST/\d+(?:\.)?\d+$]],
[[/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|T]]
.. [[ailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?]],
[[\b(008|Altresium|Argus|BaiduMobaider|BoardReader|DNSGroup|DataparkSearch|EDI|Goo]]
.. [[dzer|Grub|INGRID|Infohelfer|LinkedInBot|LOOQ|Nutch|OgScrper|PathDefender|Peew|Po]]
.. [[stPost|Steeler|Twitterbot|VSE|WebCrunch|WebZIP|Y!J-BR[A-Z]|YahooSeeker|envolk|sp]]
.. [[roose|wminer)/(\d+)(?:\.(\d+)|)(?:\.(\d+)|)]],
[[(MSIE) (\d+)\.(\d+)([a-z]\d|[a-z]|);.{0,200} MSIECrawler]],
[[(Google-HTTP-Java-Client|Apache-HttpClient|Go-http-client|scalaj-http|http%20cli]]
.. [[ent|Python-urllib|HttpMonitor|TLSProber|WinHTTP|JNLP|okhttp|aihttp|reqwest|axios]]
.. [[|unirest-(?:java|python|ruby|nodejs|php|net))(?:[ /](\d+)(?:\.(\d+)|)(?:\.(\d+)|]]
.. [[)|)]],
[[(CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Jo]]
.. [[b Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends C]]
.. [[rawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|14]]
.. [[70\.net crawler|50\.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot]]
.. [[-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}\.org_bot|arch]]
.. [[iver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bing]]
.. [[bot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardRead]]
.. [[er Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYe]]
.. [[ti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo ]]
.. [[HTTP\(S\) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConver]]
.. [[a|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngin]]
.. [[e|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Go]]
.. [[oglebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|Hid]]
.. [[denMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobil]]
.. [[e|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchi]]
.. [[ve|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Lingu]]
.. [[ee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail\.RU_Bot|masscan|masidani_]]
.. [[bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:]]
.. [[-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGato]]
.. [[r[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|]]
.. [[PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobo]]
.. [[t|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scoot]]
.. [[er|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimpleP]]
.. [[ie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy]]
.. [[|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots\.ru|Tin]]
.. [[y Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voy]]
.. [[ager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|]]
.. [[Wotbox|www\.almaden\.ibm\.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(]]
.. [[?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|Yottaa]]
.. [[Monitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www\.ze\.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)]]
.. [[(?:\.(\d+)(?:\.(\d+)|)|)|)]],
[[(?:\/[A-Za-z0-9\.]+|) {0,5}([A-Za-z0-9 \-_\!\[\]:]{0,50}(?:[Aa]rchiver|[Ii]ndexe]]
.. [[r|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50}))[/ ](\d+)(?:\.(\d+)(?:\.(\d+)]]
.. [[|)|)]],
[[(?:\/[A-Za-z0-9\.]+|) {0,5}([A-Za-z0-9 \-_\!\[\]:]{0,50}(?:[Aa]rchiver|[Ii]ndexe]]
.. [[r|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50})) (\d+)(?:\.(\d+)(?:\.(\d+)|)|]]
.. [[)]],
[[((?:[A-z0-9]{1,50}|[A-z\-]{1,50} ?|)(?: the |)(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]cr]]
.. [[ape|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]{0,50})(?:(?:[ /]| v)(\d+)(?:\.(\d+)|)(?:\.(\d+]]
.. [[)|)|)]],
}

local function match_user_agent(user_agent, conf)
user_agent = str_strip(user_agent)
if conf.whitelist then
for _, rule in ipairs(conf.whitelist) do
if re_find(user_agent, rule, "jo") then
return MATCH_ALLOW
end
end
end

if conf.blacklist then
for _, rule in ipairs(conf.blacklist) do
if re_find(user_agent, rule, "jo") then
return MATCH_DENY
end
end
end

for _, rule in ipairs(well_known_bots) do
if re_find(user_agent, rule, "jo") then
return MATCH_BOT
end
end

return MATCH_NONE
end

function _M.check_schema(conf)
local ok, err = core.schema.check(schema, conf)

if not ok then
return false, err
end

return true
end

function _M.access(conf, ctx)
local user_agent = core.request.header(ctx, "User-Agent")

if not user_agent then
return
end
-- ignore multiple instances of request headers
if type(user_agent) == "table" then
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore the UA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this corner case is that the user-agent become table when send multiple user-agent. Almost all the bot or http-client will not send request like this,i think we ignore it is a better choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if they send other UA, the check can be bypassed? This is not a good idea, especially in an open source project...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will check the table

end
local match = lrucache_useragent(user_agent, conf, match_user_agent, user_agent, conf)

if match > MATCH_ALLOW then
return 403, { message = conf.message }
end
end

return _M
1 change: 1 addition & 0 deletions conf/config-default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@ plugins: # plugin list (sorted by priority)
- batch-requests # priority: 4010
- cors # priority: 4000
- ip-restriction # priority: 3000
- bot-restriction # priority: 2999
- referer-restriction # priority: 2990
- uri-blocker # priority: 2900
- request-validation # priority: 2800
Expand Down
1 change: 1 addition & 0 deletions docs/en/latest/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
"plugins/cors",
"plugins/uri-blocker",
"plugins/ip-restriction",
"plugins/bot-restriction",
"plugins/referer-restriction",
"plugins/consumer-restriction"
]
Expand Down
130 changes: 130 additions & 0 deletions docs/en/latest/plugins/bot-restriction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: bot-restriction
---

<!--
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
-->

## Summary

- [**Name**](#name)
- [**Attributes**](#attributes)
- [**How To Enable**](#how-to-enable)
- [**Test Plugin**](#test-plugin)
- [**Disable Plugin**](#disable-plugin)

## Name

The `bot-restriction` can restrict access to a Service or a Route by either
`whitelisting` or `blacklisting` or `most well-known` bots.

## Attributes

| Name | Type | Requirement | Default | Valid | Description |
| --------- | ------------- | ----------- | ------- | ----- | ---------------------------------------- |
| whitelist | array[string] | optional | | | List of User-Agent of whitelist. |
| blacklist | array[string] | optional | | | List of User-Agent of blacklist. |
| message | string | optional | Not allowed. | [1, 1024] | Message of deny reason. |

Any of `whitelist` or `blacklist` can be optional, and can work together in this order:
whitelist->blacklist->default well-known User-Agent list.

The message can be user-defined.

## How To Enable

Creates a route or service object, and enable plugin `bot-restriction`.

```shell
curl http://127.0.0.1:9080/apisix/admin/routes/1 -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d '
{
"uri": "/index.html",
"upstream": {
"type": "roundrobin",
"nodes": {
"127.0.0.1:1980": 1
}
},
"plugins": {
"bot-restriction": {
"whitelist": [
"my-bot1",
"(Baiduspider)/(\\d+)\\.(\\d+)"
],
"blacklist": [
"my-bot2",
"(Twitterspider)/(\\d+)\\.(\\d+)"
]
}
}
}'
```

Default returns `{"message":"Not allowed"}` when rejected. If you want to use a custom message, you can configure it in the plugin section.

```json
"plugins": {
"bot-restriction": {
"blacklist": [
"my-bot2",
"(Twitterspider)/(\\d+)\\.(\\d+)"
],
"message": "Do you want to do something bad?"
}
}
```

## Test Plugin

Requests from normal User-Agent:

```shell
$ curl http://127.0.0.1:9080/index.html -i
HTTP/1.1 200 OK
...
```

Requests from bot User-Agent:

```shell
$ curl http://127.0.0.1:9080/index.html --header 'User-Agent: Twitterspider/2.0'
HTTP/1.1 403 Forbidden
```

## Disable Plugin

When you want to disable the `bot-restriction` plugin, it is very simple,
you can delete the corresponding json configuration in the plugin configuration,
no need to restart the service, it will take effect immediately:

```shell
$ curl http://127.0.0.1:2379/v2/keys/apisix/routes/1 -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d value='
{
"uri": "/index.html",
"plugins": {},
"upstream": {
"type": "roundrobin",
"nodes": {
"39.97.63.215:80": 1
}
}
}'
```

The `bot-restriction` plugin has been disabled now. It works for other plugins.
1 change: 1 addition & 0 deletions docs/zh/latest/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@
"plugins/cors",
"plugins/uri-blocker",
"plugins/ip-restriction",
"plugins/bot-restriction",
"plugins/referer-restriction",
"plugins/consumer-restriction"
]
Expand Down
Loading