-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NonBacktracking Regex optimizations #102655
Conversation
@dotnet-policy-service agree |
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/StateFlags.cs
Outdated
Show resolved
Hide resolved
forgot the zero-width-non-joiner so this is not correct, this would otherwise almost double the performance but making the conditional nullability checks cheaper did help by around 20% one thing i didnt figure out is where should tests for SymbolicRegexMatcher go? |
IIRC essentially all the tests for the non backtracking are also run against the other engines, and involve passing the pattern and text in (ie aren't testing the internals directly). Is that what you're looking to do? BTW is any of this work potentially relevant to the other engines? (I don't own regex just observer here.) |
@danmoseley Perhaps Compiled might benefit from the same alphabet compression and character kind lookup. A source generated variant of the DFA would be very impressive though, it'd run circles around all other engines |
|
runtime/src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/AttRegexTests.cs Lines 47 to 48 in e8e553b
Running thousands of regex instances with no GC inbetween while still maintaining low memory constraints in wasm tests is not easily solved, maybe some of the high memory annotations like |
Now the engine has less per-match overhead as well, The ending Z anchor is a major pain preventing lots of optimizations, I added a slight indirection to ascii-only cases to save memory (128 B instead of 65536 B) but the Other than that the results look pretty good, |
Thanks, @ieviev! I will take a look soon. |
There's still some things to do here
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in getting to this. Much appreciated your submitting it.
A couple of comments/questions...
-
In our previous conversations, you mentioned you thought inner vectorization would be possible, where we could vectorize within a match rather than just finding the next starting place. I don't see that in this PR. Is that possible? Seems like that'd be an opportunity for some significant wins.
-
I understand from your comments that these change brought significant wins on some patterns, in particular those involving non-ASCII, which is great. I'm a bit concerned though that when running this on our own perf test suite, I'm seeing regressions in various places. You can find the tests in https://github.com/dotnet/performance/tree/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions. Some of the more concerning ones were
\p{Sm}
and.{2,4}(Tom
, which regressed throughput by ~50%, andHolmes.{0,25}(...).{0,25}Holmes...
, which regressed throughput by ~25%. Thoughts?
Thanks!
src/libraries/System.Text.RegularExpressions/src/System.Text.RegularExpressions.csproj
Outdated
Show resolved
Hide resolved
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/MatchingState.cs
Outdated
Show resolved
Hide resolved
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/MatchingState.cs
Outdated
Show resolved
Hide resolved
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/MatchingState.cs
Show resolved
Hide resolved
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/MatchingState.cs
Outdated
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Outdated
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/NonBacktrackingTests.cs
Outdated
Show resolved
Hide resolved
...em.Text.RegularExpressions/tests/FunctionalTests/System.Text.RegularExpressions.Tests.csproj
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/UnitTests/SymbolicRegexTests.cs
Outdated
Show resolved
Hide resolved
Yes, this is possible. Any pattern that contains .* could be a lot faster with longer matches. It'd be best to start with inner vectorization without anchors. The presence of anchors makes it more complicated and expensive but i still think it's possible with anchors as well when followed with an anchor context lookup, also it needs a bit of testing to see where the line between match time speedup and unwanted compile/construction-time overhead is.
I'll definitely profile these as well. There is some overhead from the way edge cases are currently handled. \p{Sm} in particular could be made to skip the reversal part entirely along with other fixed length patterns. I'll follow up about this once i've done some more testing |
Now i've had a better look at the dotnet/performance benchmarks as well. Turns out i was wrong about the \p{Sm} benchmark, Try running the performance tests again - i think it should be faster across the board now. I thought about inner vectorization as well, but it doesn't seem like there's many benchmarks including .* or other star loops and i think creating a separate implementation of RegexFindOptimizations for the automata engine would be far more beneficial. Derivatives have a very concise and useful way to infer prefixes, that'd work for any pattern, so this would replace a lot of custom logic in // ... prefix start, looping back to this or any visited states is redundant
SymbolicRegexNode currentState = // ..
var mergedPrefix = solver.Empty
foreach (var minterm in minterms) {
var derivative = createDerivative(state, minterm)
if (isRedundant(derivative)) then continue;
// O(1) bitwise merge relevant transitions for the prefix
mergedPrefix = solver.Or(mergedPrefix, minterm)
}
// .. continue breadth first search for longer prefixes
return mergedPrefix There's many optimizations that currently go undetected/unused which could enable avx in more cases. The avx and transitions could be combined as well, e.g. when optimizing for a longer prefix like I wonder if replacing the algorithm choices in the match with something like while (position != end) {
// check for success
// get next state
position++;
} An optimized DFA will give you a very consistent worst case throughput of somewhere around 500mb/s on any pattern which also accounts for utf16 losses on utf8 input (why the english benchmarks are slower), AVX on benchmarks is a bit of a gamble, which sometimes helps a lot and sometimes does not. High match count overhead can sometimes dominate the benchmark as well. There's massive gains from AVX in some cases but overall i'm not exactly sure where the line is as it depends on input text as well. The worst case performance should be very good on both ascii and unicode by now, what kind of optimizations to do with prefixes depends on benchmarks used or https://github.com/dotnet/runtime-assets/blob/main/src/System.Text.RegularExpressions.TestData/Regex_RealWorldPatterns.json I'll refactor and clean up the PR and comments for a bit now. |
Ah i left an exception to check if there's over 255 minterms in some pattern as well, a pattern from RegexPcreTests did reach it so a fallback mode definitely necessary. an int array would be a 256K allocation but a couple times faster than BDD so there's a decision there as well. Would pooling the array be an option? |
b259937
to
6bb517a
Compare
Simplification, style consistency, dead code deletion, some bounds-check removal, etc.
6bb517a
to
0a0f409
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit with a last round of changes.
Otherwise, LGTM.
Thanks!
I've got another round of cleanup / simplification (some of the code in this PR, some pre-existing) I'm working on and will submit as a separate PR. |
I will be happy to provide comments or feedback if needed. |
@stephentoub @veanes
Here's the initial list of changes i'd make to the nonbacktracking regex engine,
I haven't yet documented much because i made some design choices that i need to discuss.
Rebar baseline before changes:
After changes:
update: After more optimization:
Essentially i fixed the things that hurt the performance the most, there's many wins of 300% or more already.
Some fundamental design changes:
There's low hanging fruit in the hot loop as well, i removed the IsDeadend flag entirely, since there is just one dead end state, which i stored the state ID for. The nullability and skipping checks got dedicated short-circuit lookups as having a nullable position at all (a valid match) is usually rare, something like
beq
orbgt
could be used as well. Some if if-else branches in the hotloop should be removed and resolved somewhere above.
There's a lot of performance still on the table, when compared to resharp, dedicated
DFA prefilter optimizations could help a lot as well.
Let me know what you think of the changes, i think the nonbacktracking engine could be made a lot faster than compiled in most scenarios and the symbolic automata benefits should be used a lot more aggressively.