-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from regex crate to regex-lite #4
Conversation
Heya, thanks for your interest!
That's very interesting.
Sadly they do seem to be: while the prefiltering does cut down significantly on the number of regexes to test, it usually doesn't remove it entirely. I still need to setup proper stats tracking (amongst other things) but IIRC on the samples I use with the devices regexes from the core project the number of regexes after prefiltering is usually 10-15, so the regex engine does get exercised a fair bit (that's what the sole "proper" rust-level benchmark actually looked at, I needed to know the relative perf difference between matching and capturing both failure and success to decide whether the second step should use matching or whether it could capture directly with a half-redundant successful capture at the end). On the ad-hoc and under-specified benches I've been using to track the differential between re2 and regex-filtered (633 regexes, a sample file of 75158 lines, and 100 iterations — so 7.5 million lines total), on macOS 14.5, using
The gain in memory consumption (and apparently allocations) is nice, but for me not at the cost of a ~15x performance hit and falling completely behind re2, at least not by default. Although as a minor point of interest the regex-lite version has twice the IPC (retired) of the other two (re2 is at 3.5, main is at 3.0, regex-lite is at 6.1). Now I can see that being an opt in specifically for memory constrained environment, if your project is already using regex-lite in a context where prefiltering makes sense then it'll still be an improvement. And made more palatable by the API being compatible so this would be a pretty reasonable conditional compilation thing. But definitely not by default. Furthermore although UAP only needs ascii support, and I would like an "ascii mode" to be available in regex (as in PCRE style characters classes being opt-in ASCII) to not have to rewrite regexes on the fly as I do now, I do think the excellent unicode support of
That may make the prefilter worse for the same "frontend": larger classes make it more likely for the symbolic evaluator to bail on the atoms-set it's building because it's too large and doesn't provide much predication power anymore. Although that would depend on the actual expression and I've not looked at it in any way. I would mostly expect that to be a minor edge case. |
I've also been checking the performance of What may make sense and I still have to experiment with is to disable unicode support explicitly on the Regex builder (at least opt-in), this showed in some preliminary tests a memory consumption of around double of |
With a very hacked together version:
Done by switching to I had to switch to
But that might be fixable without switching to |
Yeah I sought the help of Andrew a few months back to improve the memory use of regex-filtered (without sacrificing runtime): rust-lang/regex#1206 though this was mostly comparing regex to re2. That's what led to the regex rewriting you found, in 29b9195. I considered using If I may, what's your interest / use case, is it an independent use of regex-filtered, or is it uap?
I'm not sure it is, it would at the very least require extremely complicated rewriting rules. |
I went down a rabbit hole of optimizing memory consumption in relay and most of it is due to regexes which includes the user agent parsing we have bundled. We're using In my testing on how to optimize regexes I also included some tldr; interested it in
And definitely not future proof.
Do you mean performance or memory consumption here? It only really gets an edge when disabling unicode.
Yeah it will require some fallible |
Memory, you can read the discussion thread I linked to and the most massive memory gains I observed back then (compared to string/unicode) had to do with bounded repetitions. Which I decided to rewrite to unbounded when large enough to seem non-semantic. |
Incidentally for relay maybe paring down the regexes.yaml would be an option? I think there’s a lot of old regexes which might not see much use anymore and may not be of interest depending how precisely you want to track agents. |
Switches from
regex
toregex-lite
this massively improves the memory footprint. From ~180 MiB down to ~10 MiB while also heavily reducing allocations from a total of ~1.2 GiB down to ~72 MiB.I did not make the change configurable via a cargo feature since I assume performance should stay roughly the same due to the optimized lookup using
aho_corasick
(also additive features are awkward for this). If you think these assumptions are wrong please let me know.Note this is technically a breaking change, I removed the
ParseError:: RegexTooLarge
variant sinceregex_lite
does not expose this information. To not make it breaking I can add the (unused) variant back in. But with the switch of the dependency it may make sense to release a 0.3 anyways.I also removed the rewriting of
\d
character classes (it seems like this is fine forregex-filtered
), sinceregex-lite
already only matches on ascii and it does not support nested character classes, which break the regex.I validated memory consumption and usage using the
dhat
crate:tmp/Cargo.toml
:tmp/src/main.rs
:Output from my local system (
cargo run --release
):