Switch from regex crate to regex-lite #4

Dav1dde · 2024-08-12T15:27:16Z

Switches from regex to regex-lite this massively improves the memory footprint. From ~180 MiB down to ~10 MiB while also heavily reducing allocations from a total of ~1.2 GiB down to ~72 MiB.

I did not make the change configurable via a cargo feature since I assume performance should stay roughly the same due to the optimized lookup using aho_corasick (also additive features are awkward for this). If you think these assumptions are wrong please let me know.

Note this is technically a breaking change, I removed the ParseError:: RegexTooLarge variant since regex_lite does not expose this information. To not make it breaking I can add the (unused) variant back in. But with the switch of the dependency it may make sense to release a 0.3 anyways.

I also removed the rewriting of \d character classes (it seems like this is fine for regex-filtered), since regex-lite already only matches on ascii and it does not support nested character classes, which break the regex.

I validated memory consumption and usage using the dhat crate:

`tmp/Cargo.toml`:

[workspace]

[package]
name = "tmp"
version = "0.1.0"
edition = "2021"

[dependencies]
dhat = "0.3"
ua-parser-cratesio = { package = "ua-parser" , version = "0.2" }
ua-parser = { path = "../ua-parser/" }
serde_yaml = "0.8"
humansize = "2"

`tmp/src/main.rs`:

#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

macro_rules! with_mem {
    ($e:expr) => {{
        let before = dhat::HeapStats::get();
        let result = { $e };
        let after = dhat::HeapStats::get();

        println!(
            "used memory `{}`: peak: {} | total: {} | current: {}",
            stringify!($e),
            humansize::format_size(after.max_bytes, humansize::DECIMAL),
            humansize::format_size(after.total_bytes.saturating_sub(before.total_bytes), humansize::DECIMAL),
            humansize::format_size(after.curr_bytes.saturating_sub(before.curr_bytes), humansize::DECIMAL),
        );

        result
    }};
}

fn main() {
    let _profiler = dhat::Profiler::new_heap();

    drop(with_mem!(uap_updated_lite()));
    drop(with_mem!(uap_crates_io_regex()));

    std::mem::forget(_profiler); // We're not interested in the profile.
}

static UA_REGEXES: &str = include_str!("../../ua-parser/uap-core/regexes.yaml");

fn uap_updated_lite() -> ua_parser::Extractor<'static> {
    let regexes: ua_parser::Regexes<'static> = serde_yaml::from_str(UA_REGEXES).unwrap();
    ua_parser::Extractor::try_from(regexes).unwrap()
}

fn uap_crates_io_regex() -> ua_parser_cratesio::Extractor<'static> {
    let regexes: ua_parser_cratesio::Regexes<'static> = serde_yaml::from_str(UA_REGEXES).unwrap();
    ua_parser_cratesio::Extractor::try_from(regexes).unwrap()
}

Output from my local system (cargo run --release):

used memory `uap_updated_lite()`: peak: 10.59 MB | total: 71.25 MB | current: 9.61 MB
used memory `uap_crates_io_regex()`: peak: 180.72 MB | total: 1.27 GB | current: 179.74 MB

masklinn · 2024-08-15T14:38:49Z

Heya, thanks for your interest!

Switches from regex to regex-lite this massively improves the memory footprint. From ~180 MiB down to ~10 MiB while also heavily reducing allocations from a total of ~1.2 GiB down to ~72 MiB.

That's very interesting.

If you think these assumptions are wrong please let me know.

Sadly they do seem to be: while the prefiltering does cut down significantly on the number of regexes to test, it usually doesn't remove it entirely. I still need to setup proper stats tracking (amongst other things) but IIRC on the samples I use with the devices regexes from the core project the number of regexes after prefiltering is usually 10-15, so the regex engine does get exercised a fair bit (that's what the sole "proper" rust-level benchmark actually looked at, I needed to know the relative perf difference between matching and capturing both failure and success to decide whether the second step should use matching or whether it could capture directly with a half-redundant successful capture at the end).

On the ad-hoc and under-specified benches I've been using to track the differential between re2 and regex-filtered (633 regexes, a sample file of 75158 lines, and 100 iterations — so 7.5 million lines total), on macOS 14.5, using time -l for rough tracking:

re2 completes in 47.84 seconds real (157102 per second), with a peak mem of 45057856 (43MB)
main completes in 42.77s (175725 per second), with a peak mem of 140658176 (134MB)
regex-lite completes in 766.31s (9808 per second), with a peak mem of 29230528 (28MB)

The gain in memory consumption (and apparently allocations) is nice, but for me not at the cost of a ~15x performance hit and falling completely behind re2, at least not by default. Although as a minor point of interest the regex-lite version has twice the IPC (retired) of the other two (re2 is at 3.5, main is at 3.0, regex-lite is at 6.1).

Now I can see that being an opt in specifically for memory constrained environment, if your project is already using regex-lite in a context where prefiltering makes sense then it'll still be an improvement. And made more palatable by the API being compatible so this would be a pretty reasonable conditional compilation thing. But definitely not by default.

Furthermore although UAP only needs ascii support, and I would like an "ascii mode" to be available in regex (as in PCRE style characters classes being opt-in ASCII) to not have to rewrite regexes on the fly as I do now, I do think the excellent unicode support of regex is an asset to the ecosystem and should be kept available out of the box for regex-filtered.

it seems like this is fine for regex-filtered

That may make the prefilter worse for the same "frontend": larger classes make it more likely for the symbolic evaluator to bail on the atoms-set it's building because it's too large and doesn't provide much predication power anymore.

Although that would depend on the actual expression and I've not looked at it in any way. I would mostly expect that to be a minor edge case.

Dav1dde · 2024-08-15T15:52:17Z

I've also been checking the performance of regex-lite and I did not expect that much of a difference to the regex crate I must admit. I agree the performance hit here is too large. As opt in it still may make sense, although supporting this may be a bit awkward.

What may make sense and I still have to experiment with is to disable unicode support explicitly on the Regex builder (at least opt-in), this showed in some preliminary tests a memory consumption of around double of regex-lite but this is still quite a bit lower than without. Which honestly was a bit surprising to me since the regex themself don't even match on unicode.

Dav1dde · 2024-08-15T16:14:42Z

With a very hacked together version:

used memory `uap_lite()`: peak: 9.89 MiB | total: 66.18 MiB | current: 8.96 MiB
used memory `uap_no_unicode()`: peak: 26.47 MiB | total: 185.64 MiB | current: 25.52 MiB
used memory `uap()`: peak: 166.76 MiB | total: 1.13 GiB | current: 165.83 MiB

Done by switching to bytes::Regex with unicode(false), which is also a huge improvement.

I had to switch to bytes::Regex because of:

called `Result::unwrap()` on an `Err` value: ParseError(SyntaxError("regex parse error:
    (ArcGIS Pro)(?: ([0-9]+)\\.([0-9]+)\\.([^ ]+)|)
                                             ^^^^
    error: pattern can match invalid UTF-8")
)

But that might be fixable without switching to bytes::Regex.

masklinn · 2024-08-15T17:12:04Z

Yeah I sought the help of Andrew a few months back to improve the memory use of regex-filtered (without sacrificing runtime): rust-lang/regex#1206 though this was mostly comparing regex to re2.

That's what led to the regex rewriting you found, in 29b9195.

I considered using regex::bytes but it seemed like the gain seemed pretty small for the hassle in terms of interface. That is, mostly, having to deal with bytes extracted data. And as you discovered unicode(false) without bytes is not really an option because it breaks many patterns which UAP uses.

If I may, what's your interest / use case, is it an independent use of regex-filtered, or is it uap?

But that might be fixable without switching to bytes::Regex.

I'm not sure it is, it would at the very least require extremely complicated rewriting rules.

Dav1dde · 2024-08-15T17:48:56Z

If I may, what's your interest / use case, is it an independent use of regex-filtered, or is it uap?

I went down a rabbit hole of optimizing memory consumption in relay and most of it is due to regexes which includes the user agent parsing we have bundled. We're using uaparser (not this crate) and while researching a bit what is possible I found this crate (ua-parser) and was delighted by the regex-filtered work (until you replied I assumed this was just a helper crate for ua-parser) and tried switching to this crate and noticed it had the same memory issues.

In my testing on how to optimize regexes I also included some regex-lite tests and it seemed like a very good drop-in replacement solving most of the memory issues. Tried it out on this crate and was a bit blown away by the improvements.

tldr; interested it in uap stayed because of regex-filtered.

I'm not sure it is, it would at the very least require extremely complicated rewriting rules.

And definitely not future proof.

I considered using regex::bytes but it seemed like the gain seemed pretty small for the hassle in terms of interface.

Do you mean performance or memory consumption here? It only really gets an edge when disabling unicode.

That is, mostly, having to deal with bytes extracted data.

Yeah it will require some fallible String::from_utf8's.

masklinn · 2024-08-15T21:11:27Z

Do you mean performance or memory consumption here? It only really gets an edge when disabling unicode.

Memory, you can read the discussion thread I linked to and the most massive memory gains I observed back then (compared to string/unicode) had to do with bounded repetitions. Which I decided to rewrite to unbounded when large enough to seem non-semantic.

masklinn · 2024-08-15T21:14:57Z

Incidentally for relay maybe paring down the regexes.yaml would be an option? I think there’s a lot of old regexes which might not see much use anymore and may not be of interest depending how precisely you want to track agents.

Switch from regex crate to regex-lite

fa6bd76

Dav1dde force-pushed the regex-lite branch from c15e850 to fa6bd76 Compare August 12, 2024 15:40

Dav1dde mentioned this pull request Aug 14, 2024

Switch to or support regex-lite oceanlewis/uap-rs#18

Open

Dav1dde closed this Aug 15, 2024

masklinn mentioned this pull request Aug 17, 2024

Parameterize regex engine #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from regex crate to regex-lite #4

Switch from regex crate to regex-lite #4

Dav1dde commented Aug 12, 2024 •

edited

Loading

masklinn commented Aug 15, 2024 •

edited

Loading

Dav1dde commented Aug 15, 2024

Dav1dde commented Aug 15, 2024

masklinn commented Aug 15, 2024 •

edited

Loading

Dav1dde commented Aug 15, 2024

masklinn commented Aug 15, 2024

masklinn commented Aug 15, 2024

Switch from regex crate to regex-lite #4

Switch from regex crate to regex-lite #4

Conversation

Dav1dde commented Aug 12, 2024 • edited Loading

tmp/Cargo.toml:

tmp/src/main.rs:

masklinn commented Aug 15, 2024 • edited Loading

Dav1dde commented Aug 15, 2024

Dav1dde commented Aug 15, 2024

masklinn commented Aug 15, 2024 • edited Loading

Dav1dde commented Aug 15, 2024

masklinn commented Aug 15, 2024

masklinn commented Aug 15, 2024

Dav1dde commented Aug 12, 2024 •

edited

Loading

`tmp/Cargo.toml`:

`tmp/src/main.rs`:

masklinn commented Aug 15, 2024 •

edited

Loading

masklinn commented Aug 15, 2024 •

edited

Loading