Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch regex engine from oniguruma to fancy-regex #18

Closed
bminixhofer opened this issue Feb 2, 2021 · 7 comments · Fixed by #36
Closed

switch regex engine from oniguruma to fancy-regex #18

bminixhofer opened this issue Feb 2, 2021 · 7 comments · Fixed by #36
Labels
enhancement New feature or request good first issue Good for newcomers P2 Medium priority

Comments

@bminixhofer
Copy link
Owner

I would like to switch from rust-onig to fancy-regex.

This would probably come with a speedup and remove the last non-Rust dependency. This is nice in general and would enable compiling to WebAssembly.

Changing this in NLPRule would be easy but it is currently blocked by fancy-regex/fancy-regex#59 and fancy-regex/fancy-regex#49.

@bminixhofer bminixhofer added enhancement New feature or request good first issue Good for newcomers P2 Medium priority labels Feb 2, 2021
@robinst
Copy link

robinst commented Feb 15, 2021

This would probably come with a speedup

I'd be curious to see if there's a speedup or not. Keep in mind that Oniguruma is highly-optimized (for its type of regex engine). The situations where I could see fancy-regex win is for patterns that can be delegated to the regex crate whereas onig needs backtracking.

@bminixhofer
Copy link
Owner Author

I did some early comparisons in the meantime and my findings are roughly consistent with trishume/syntect#34. I get a ~15% slowdown from switching tofancy-regex. I'll do some more investigation. The vast majority of regexes nlprule runs should be delegated to the regex crate, there might be a few fancy regexes which take most of the time where the result could be cached.

I also already pre-compute lots of regex matches (making speed pretty much irrelevant for those) but there is room for improvement there too which should decrease the overall % of time spent on regex matching.

I definitely want to switch to fancy-regex. If things don't work out I might implement something like trishume/syntect#270 but in general I'm OK with slightly slower regex matching if the next release is in total faster than the previous release.

@robinst
Copy link

robinst commented Feb 16, 2021

If you have a particular regex that's delegated to regex and is unexpectedly slow, we can also ask burntsushi to have a look, he's very helpful.

@bminixhofer
Copy link
Owner Author

bminixhofer commented Feb 19, 2021

There is now a PR for this: #36. I opted for the solution from trishume/syntect#270 which works quite well. I already had a wrapper around the Regex anyway for serialization so this didn't add a lot of complexity and was quite easy to implement.

At the moment I still have some problems with mismatches between oniguruma and fancy-regex / regex:

  1. Oniguruma has better case folding support
fn main() {
    let regex_fancy = regex::Regex::new(r"(?i)ss|fi").unwrap();
    let regex_onig = onig::Regex::new(r"(?i)ss|fi").unwrap();

    for text in &["ß", "ss", "fi", "fi"] {
        println!(
            "{}\t{}\t{}",
            text,
            regex_fancy.is_match(text),
            regex_onig.is_match(text)
        );
    }
}

prints

ß       false   true
ss      true    true
fi      false   true
fi       true    true
  1. Unicode property classes in a case-insensitive oniguruma regex are still case sensitive:
fn main() {
    let regex_fancy = regex::Regex::new(r"(?i)\p{Lu}").unwrap();
    let regex_onig = onig::Regex::new(r"(?i)\p{Lu}").unwrap();

    for text in &["A", "a"] {
        println!(
            "{}\t{}\t{}",
            text,
            regex_fancy.is_match(text),
            regex_onig.is_match(text)
        );
    }
}

prints

A       true    true
a       true    false

It is enough to reliably detect and disable regexes with 1.) since they are only a few but I need a fix for 2.). I tried to sort of escape the \p{Lu} by adding (?-i) before and and (?i) afterwards but I don't think that works in every case e. g. inside [] sets. I also don't know how to reliably detect 1.) - I don't think that's trivial.

@bminixhofer
Copy link
Owner Author

bminixhofer commented Feb 19, 2021

I think I can get around both of these issues by using regex-syntax to naively construct lowercase regexes by replacing e g. a with [aA] for all literals instead of using the (?i) flag. I am parsing regexes which are also used in a Java project, this behavior would probably be closest to how it behaves there.

@bminixhofer
Copy link
Owner Author

bminixhofer commented Feb 21, 2021

This is now solved by:

  1. a modular regex backend like in Move all regex usage to separate module to add support for fancy-regex trishume/syntect#270
  2. a function from_java_regex which uses regex-syntax to parse the regular expressions, fix errors (e. g. unnecessary escaped chars) and get them to a state in which both fancy-regex and Oniguruma do the same thing (e. g. removing the case-insensitive flag and instead naively making it case insensitive).

In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex behaves the same as Oniguruma in every case now.

Regarding speed:

I do not see a significant difference in the benchmark between fancy-regex and Oniguruma when running without parallelism. Curiously, with parallelism enabled fancy-regex is 10%-15% slower.

This would warrant some further investigation. I'm not so sure about the quality of the benchmark. It uses the Python bindings which incur some additional overhead and it runs the entire pipeline. To investigate the slowdown I'd have to do the benchmark in Rust and check which part of the pipeline is slower.

But for now, I'm happy with just having both backends with the performance difference being "inconclusive" (and both being fast!).

@robinst
Copy link

robinst commented Feb 22, 2021

Awesome, glad to hear :).

On 1., that's a limitation of regex: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl15-simple-loose-matches

On 2., that seems to be the desired behavior in regex: https://github.com/rust-lang/regex/blob/d5bf98f293b48174d5378471d01c2e0ef271bbbc/tests/unicode.rs#L12

Note that PCRE agrees with regex there:

$ perl -e 'print "matches" if "a" =~ /(?i)\p{Lu}/'
matches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers P2 Medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants