Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hashset-mode into ACL rules #627

Merged
merged 2 commits into from
Sep 18, 2021
Merged

Conversation

Y0ba
Copy link
Contributor

@Y0ba Y0ba commented Sep 18, 2021

I have about 320000 hostnames rules and with regexps both performance and memory consumption are abysmal. Hashsets are much faster and smaller in my case. Memory decreased from 2.8GB to 35MB. It's possible to increase performance even further - replace default hasher with ahash, but since it requires to add new package I'll leave that decision for you.

I added hashsets' parsing in backward-compatible fashion, so old rules will work without any changes. Also I added an optimization to regexp matcher. Since all dns records are encoded in ASCII, there is no need for unicode support in regexes.

@zonyitoo
Copy link
Collaborator

It's possible to increase performance even further - replace default hasher with ahash, but since it requires to add new package I'll leave that decision for you.

Just do it.

@zonyitoo
Copy link
Collaborator

Shouldn't we support sub-domains? For example:

||example.com

matches example.com and its sub-domains.

@zonyitoo
Copy link
Collaborator

One of the collaborators indicates that ahash may have worse performance than the default harsher in multithreaded environment.

@Y0ba
Copy link
Contributor Author

Y0ba commented Sep 18, 2021

Shouldn't we support sub-domains?

Done. I've added some ad-hoc tree-like structure powered by hashmaps. Not sure how it fast is it relative to specialized collections. But in my (pathologic) case it's hundreds times faster than regexes.

Now |google.com will do exact matching and ||google.com matching with subdomains. Regexes are still here both for backward compatibility and some advanced matching. Also changed the way to disable unicode support. Now it doesn't modify regexes.

One of the collaborators indicates that ahash may have worse performance than the default harsher in multithreaded environment.

Ok. I'll leave this as is. Performance is already good enough for me.

@zonyitoo
Copy link
Collaborator

Excellent.

crates/shadowsocks-service/src/acl/mod.rs Outdated Show resolved Hide resolved
crates/shadowsocks-service/src/acl/mod.rs Outdated Show resolved Hide resolved
@zonyitoo zonyitoo merged commit 4500d46 into shadowsocks:master Sep 18, 2021
@dev4u
Copy link

dev4u commented Sep 18, 2021

这个改动好大,原来的acl文件格式兼容吗?

@dev4u
Copy link

dev4u commented Sep 18, 2021

2021-09-18 22:47:31.817 22378-31220/? E/libsslocal: loading ACL "/data/user_de/0/com.github.shadowsocks/no_backup/custom-rules.acl", [white_list] or [proxy_list] regex error: regex parse error:
2021-09-18 22:47:31.817 22378-31220/? E/libsslocal:     (^|\.)boxun.+\.azurewebsites\.net
2021-09-18 22:47:31.817 22378-31220/? E/libsslocal:                ^
2021-09-18 22:47:31.817 22378-31220/? E/libsslocal: error: pattern can match invalid UTF-8

这个正则有什么问题吗?

@IceCodeNew
Copy link

IceCodeNew commented Sep 18, 2021

这个正则有什么问题吗?

@dev4u

error: pattern can match invalid UTF-8

因为 regex 匹配时关闭了 Unicode 字符的匹配,所以这里不能用 . 来匹配任意字符
我粗略的思考是用 [a-zA-Z0-9-] 来替换 .。如果有失配的情况欢迎补充。

@Y0ba
Copy link
Contributor Author

Y0ba commented Sep 18, 2021

No, regex is fine. That's some quirk of Regex crate I didn't know before. I've already prepared patch and will fix it in followup pull request after some testing.

@dev4u
Copy link

dev4u commented Sep 18, 2021

这个正则有什么问题吗?

@dev4u

error: pattern can match invalid UTF-8

因为 regex 匹配时关闭了 Unicode 字符的匹配,所以这里不能用 . 来匹配任意字符
我粗略的思考是用 [a-zA-Z0-9-] 来替换 .。如果有失配的情况欢迎补充。

谢谢。我想问,那如果开启unicode匹配有什么隐患吗?
我的acl文件,是从网上第三方转换过来的。如果开启unicode匹配不会带来安全隐患,或许考虑一下重新开启,以最大程度兼容网上流传的acl列表。

@IceCodeNew
Copy link

这个正则有什么问题吗?

@dev4u

error: pattern can match invalid UTF-8

因为 regex 匹配时关闭了 Unicode 字符的匹配,所以这里不能用 . 来匹配任意字符
我粗略的思考是用 [a-zA-Z0-9-] 来替换 .。如果有失配的情况欢迎补充。

谢谢。我想问,那如果开启unicode匹配有什么隐患吗?
我的acl文件,是从网上第三方转换过来的。如果开启unicode匹配不会带来安全隐患,或许考虑一下重新开启,以最大程度兼容网上流传的acl列表。

PR 里面已经写得很清楚了,域名中并不会出现 ascii 字符以外的内容(如果要匹配诸如 域名.中国 这样的地址,需要转换为 punycode 再写入列表中

关闭 unicode 字符匹配可以提升匹配速度。

@Y0ba Y0ba deleted the add-acl-set branch September 18, 2021 15:42
@zonyitoo
Copy link
Collaborator

Using unicode in regex matching will definitely waste memory because domain names must be ASCII characters only (punycode).

#629 should fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants