-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Captures support for NonBacktracking #65129
Conversation
Re-enable replacement patterns. Fix eager derivative. Fix capture numbering to work with sparsely numbered groups.
Also take any subcaptures out of RegexExperiment intersection and negation tests, as capture semantics with these are not correct yet.
This ensures the correct length matches always.
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsThis adds captures support for The support relies on a new variant of the symbolic version of the Antimirov derivative that only includes the parts of the pattern that the backtracking engine would visit before hitting a nullable path (one that doesn't consume any characters). This in conjunction with maintaining a prioritized set of states in the NFA simulation allows faithful reproduction of the match generation semantics that the backtracking engines have. The captures support does incur a slowdown, which is sometimes significant, since NFA simulation with application of effects is significantly more heavy weight than DFA simulation. Currently this cost is also paid when no subcaptures are needed, which can be avoided in the future by modifying the existing third phase to use the new prioritized "eager" derivative. A Brzozowski variant of it would be relatively straightforward to create. One optimization that is currently missing, is how effects associated with finalizing a match are applied. They are currently interpreted from the SymbolicRegexNode tree. These should either be cached or delayed until the match is actually decided to avoid paying the cost multiple times. I've gone through all the tests and enabled everything relying on subcaptures that I could find.
|
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs
Show resolved
Hide resolved
...stem.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/DerivativeEffect.cs
Outdated
Show resolved
Hide resolved
...stem.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/DerivativeEffect.cs
Outdated
Show resolved
Hide resolved
...stem.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/DfaMatchingState.cs
Show resolved
Hide resolved
...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SparseIntMap.cs
Outdated
Show resolved
Hide resolved
...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SparseIntMap.cs
Outdated
Show resolved
Hide resolved
...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SparseIntMap.cs
Outdated
Show resolved
Hide resolved
...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SparseIntMap.cs
Outdated
Show resolved
Hide resolved
...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SparseIntMap.cs
Outdated
Show resolved
Hide resolved
...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SparseIntMap.cs
Outdated
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexBuilder.cs
Outdated
Show resolved
Hide resolved
Thanks, @olsaarik!
Is it a fair assumption with the most recent round of changes, if there aren't any captures / if someone specifies ExplicitCaptures, that additional overhead evaporate? So effectively if someone wanted the old behavior and perf, they can still get it? |
Yes, exactly. I had to add those groups to the Sherlock tests to get the performance hit. |
@veanes Something I found I needed to add in addition to the loop subsumption/combining optimization was ordered deduplication such that entries don't get repeated. I'm guessing that was causing a significant performance hit before I added it (with things blowing up). |
@olsaarik I think getting rid of unordered-or and the related data-structure that uses HashSets and Dictionaries, (as we discussed earlier) and instead normalizes the ordered-or say using hashcodes and removes duplicates would benefit both ordered and unordered cases. In unordered cases simply more aggressive simplifications would be used but e.g. removing duplicates would be done in both cases. The underlying AST node would just be the un-ordered-or. I am not suggesting to do it this time around, but maybe for the next round of edits, because this is a bigger edit. |
This avoids some repeated allocations in the capturing mode.
@veanes I agree, the unordered nodes would just additionally sort with the hash code oslt. The deduplication is already in, so I think making that change is mostly just removing code all around. I agree lets leave it for after this PR. |
src/libraries/System.Text.RegularExpressions/src/System.Text.RegularExpressions.csproj
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs
Outdated
Show resolved
Hide resolved
...stem.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/DfaMatchingState.cs
Show resolved
Hide resolved
...gularExpressions/src/System/Text/RegularExpressions/Symbolic/RegexNodeToSymbolicConverter.cs
Outdated
Show resolved
Hide resolved
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicMatch.cs
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexBuilder.cs
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Outdated
Show resolved
Hide resolved
@@ -220,6 +224,11 @@ bool WithCache(uint context) | |||
is_nullable = _alts.IsNullableFor(context); | |||
break; | |||
|
|||
case SymbolicRegexKind.OrderedOr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that is for sure something that should be done outside of this PR, but we should probably should unify with the none-symbolic engines and use RegexNodeKind
enum for these instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea. The RegexNodeKind
has some entries that aren't valid for SymbolicRegexNode
, so some extra checks on the symbolic side might be in order, and some new node kinds in RegexNodeKind
would be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some overlap but also some differences between RegexNodeKind
and SymbolicRegexKind
,
(e.g. Not
, And
, if they are going to remain, but also other cases such as EndAnchorZRev
that is EndAnchorZ
when reversed, WatchDog
, and some cases in RegexNodeKind
that are not distinguished in SymbolicRegexKind
concerning for example different kinds of loops and their loop bodies)
...RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexRunnerFactory.cs
Outdated
Show resolved
Hide resolved
[ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.IsNetCore))] | ||
public void SRMTest_ConjuctionIsMatch() | ||
//[ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.IsNetCore))] | ||
private void SRMTest_ConjuctionIsMatch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the deal with all of these commented out tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are tests that were added for the unexposed conjunction and negation support, but which the captures support breaks. We're not quite sure what the semantics for conjunction and negation should be with captures, so it was easiest to disable the tests for now.
Thanks, @olsaarik. Let's get this merged, and we can subsequently address style, comments, remaining perf concerns, etc. |
This adds captures support for
RegexOptions.NonBacktracking
by replacing the third phase of the match generation algorithm with an NFA simulation that includes additional effects to record capture group starts and ends on transitions. Multiple captures of a single group are not supported, only the last one is.The support relies on a new variant of the symbolic version of the Antimirov derivative that only includes the parts of the pattern that the backtracking engine would visit before hitting a nullable path (one that doesn't consume any characters). This in conjunction with maintaining a prioritized set of states in the NFA simulation allows faithful reproduction of the match generation semantics that the backtracking engines have.
The captures support does incur a slowdown, which is sometimes significant, since NFA simulation with application of effects is significantly more heavy weight than DFA simulation. Currently this cost is also paid when no subcaptures are needed, which can be avoided in the future by modifying the existing third phase to use the new prioritized "eager" derivative. A Brzozowski variant of it would be relatively straightforward to create.
One optimization that is currently missing, is how effects associated with finalizing a match are applied. They are currently interpreted from the SymbolicRegexNode tree. These should either be cached or delayed until the match is actually decided to avoid paying the cost multiple times.
I've gone through all the tests and enabled everything relying on subcaptures that I could find.