Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Captures support for NonBacktracking #65129

Merged
merged 29 commits into from
Feb 12, 2022

Conversation

olsaarik
Copy link
Contributor

This adds captures support for RegexOptions.NonBacktracking by replacing the third phase of the match generation algorithm with an NFA simulation that includes additional effects to record capture group starts and ends on transitions. Multiple captures of a single group are not supported, only the last one is.

The support relies on a new variant of the symbolic version of the Antimirov derivative that only includes the parts of the pattern that the backtracking engine would visit before hitting a nullable path (one that doesn't consume any characters). This in conjunction with maintaining a prioritized set of states in the NFA simulation allows faithful reproduction of the match generation semantics that the backtracking engines have.

The captures support does incur a slowdown, which is sometimes significant, since NFA simulation with application of effects is significantly more heavy weight than DFA simulation. Currently this cost is also paid when no subcaptures are needed, which can be avoided in the future by modifying the existing third phase to use the new prioritized "eager" derivative. A Brzozowski variant of it would be relatively straightforward to create.

One optimization that is currently missing, is how effects associated with finalizing a match are applied. They are currently interpreted from the SymbolicRegexNode tree. These should either be cached or delayed until the match is actually decided to avoid paying the cost multiple times.

I've gone through all the tests and enabled everything relying on subcaptures that I could find.

@ghost
Copy link

ghost commented Feb 10, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This adds captures support for RegexOptions.NonBacktracking by replacing the third phase of the match generation algorithm with an NFA simulation that includes additional effects to record capture group starts and ends on transitions. Multiple captures of a single group are not supported, only the last one is.

The support relies on a new variant of the symbolic version of the Antimirov derivative that only includes the parts of the pattern that the backtracking engine would visit before hitting a nullable path (one that doesn't consume any characters). This in conjunction with maintaining a prioritized set of states in the NFA simulation allows faithful reproduction of the match generation semantics that the backtracking engines have.

The captures support does incur a slowdown, which is sometimes significant, since NFA simulation with application of effects is significantly more heavy weight than DFA simulation. Currently this cost is also paid when no subcaptures are needed, which can be avoided in the future by modifying the existing third phase to use the new prioritized "eager" derivative. A Brzozowski variant of it would be relatively straightforward to create.

One optimization that is currently missing, is how effects associated with finalizing a match are applied. They are currently interpreted from the SymbolicRegexNode tree. These should either be cached or delayed until the match is actually decided to avoid paying the cost multiple times.

I've gone through all the tests and enabled everything relying on subcaptures that I could find.

Author: olsaarik
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@stephentoub
Copy link
Member

Thanks, @olsaarik!

The performance hit in most cases pretty reasonable

Is it a fair assumption with the most recent round of changes, if there aren't any captures / if someone specifies ExplicitCaptures, that additional overhead evaporate? So effectively if someone wanted the old behavior and perf, they can still get it?

@olsaarik
Copy link
Contributor Author

Yes, exactly. I had to add those groups to the Sherlock tests to get the performance hit.

@olsaarik
Copy link
Contributor Author

@veanes Something I found I needed to add in addition to the loop subsumption/combining optimization was ordered deduplication such that entries don't get repeated. I'm guessing that was causing a significant performance hit before I added it (with things blowing up).

@veanes
Copy link
Contributor

veanes commented Feb 11, 2022

@olsaarik I think getting rid of unordered-or and the related data-structure that uses HashSets and Dictionaries, (as we discussed earlier) and instead normalizes the ordered-or say using hashcodes and removes duplicates would benefit both ordered and unordered cases. In unordered cases simply more aggressive simplifications would be used but e.g. removing duplicates would be done in both cases. The underlying AST node would just be the un-ordered-or. I am not suggesting to do it this time around, but maybe for the next round of edits, because this is a bigger edit.

This avoids some repeated allocations in the capturing mode.
@olsaarik
Copy link
Contributor Author

@veanes I agree, the unordered nodes would just additionally sort with the hash code oslt. The deduplication is already in, so I think making that change is mostly just removing code all around. I agree lets leave it for after this PR.

@@ -220,6 +224,11 @@ bool WithCache(uint context)
is_nullable = _alts.IsNullableFor(context);
break;

case SymbolicRegexKind.OrderedOr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that is for sure something that should be done outside of this PR, but we should probably should unify with the none-symbolic engines and use RegexNodeKind enum for these instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea. The RegexNodeKind has some entries that aren't valid for SymbolicRegexNode, so some extra checks on the symbolic side might be in order, and some new node kinds in RegexNodeKind would be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some overlap but also some differences between RegexNodeKind and SymbolicRegexKind,
(e.g. Not, And, if they are going to remain, but also other cases such as EndAnchorZRev that is EndAnchorZ when reversed, WatchDog, and some cases in RegexNodeKind that are not distinguished in SymbolicRegexKind concerning for example different kinds of loops and their loop bodies)

[ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.IsNetCore))]
public void SRMTest_ConjuctionIsMatch()
//[ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.IsNetCore))]
private void SRMTest_ConjuctionIsMatch()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the deal with all of these commented out tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are tests that were added for the unexposed conjunction and negation support, but which the captures support breaks. We're not quite sure what the semantics for conjunction and negation should be with captures, so it was easiest to disable the tests for now.

@stephentoub
Copy link
Member

Thanks, @olsaarik. Let's get this merged, and we can subsequently address style, comments, remaining perf concerns, etc.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants