Skip to content

Commit

Permalink
Improve Regex UpdateBumpalong optimization for non-atomic and lazy lo…
Browse files Browse the repository at this point in the history
…ops (#63398)

* Improve handling of UpdateBumpalong for non-atomic loops

For atomic greedy loops, UpdateBumpalong is serving its purpose: upon consuming as much as possible for the loop, base.runtextpos is set to that position so that the next time FindFirstChar runs, it'll start from at least that location.  However, for non-atomic greedy loops, with base.runtextpos being set to the ending position after each backtracking, we end up inadvertently voiding any benefits of the UpdateBumpalong, as we end up overwriting the further position with the shorter position.  A simple tweak to that, only setting the position if it's greater, yields significant benefits, in particular when there's no match.

* Add more tests for lazy loops

These just duplicate the greedy loop tests and tweak them for lazy.

* Insert UpdateBumpalong for lazy loops as well

* Address PR feedback
  • Loading branch information
stephentoub authored Jan 15, 2022
1 parent 513fe28 commit 40047f2
Show file tree
Hide file tree
Showing 5 changed files with 167 additions and 27 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -1815,7 +1815,10 @@ void EmitUpdateBumpalong(RegexNode node)
Debug.Assert(node.Type is RegexNode.UpdateBumpalong, $"Unexpected type: {node.Type}");

TransferSliceStaticPosToPos();
writer.WriteLine("base.runtextpos = pos;");
using (EmitBlock(writer, "if (base.runtextpos < pos)"))
{
writer.WriteLine("base.runtextpos = pos;");
}
}

// Emits code for a concatenation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1997,11 +1997,19 @@ void EmitUpdateBumpalong(RegexNode node)
{
Debug.Assert(node.Type is RegexNode.UpdateBumpalong, $"Unexpected type: {node.Type}");

// base.runtextpos = pos;
// if (base.runtextpos < pos)
// {
// base.runtextpos = pos;
// }
TransferSliceStaticPosToPos();
Ldthisfld(s_runtextposField);
Ldloc(pos);
Label skipUpdate = DefineLabel();
Bge(skipUpdate);
Ldthis();
Ldloc(pos);
Stfld(s_runtextposField);
MarkLabel(skipUpdate);
}

// Emits code for a concatenation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1101,10 +1101,17 @@ protected override void Go()
case RegexCode.UpdateBumpalong:
// UpdateBumpalong should only exist in the code stream at such a point where the root
// of the backtracking stack contains the runtextpos from the start of this Go call. Replace
// that tracking value with the current runtextpos value.
runtrack![runtrack.Length - 1] = runtextpos;
advance = 0;
continue;
// that tracking value with the current runtextpos value if it's greater.
{
Debug.Assert(!_rightToLeft, "UpdateBumpalongs aren't added for RTL");
ref int trackingpos = ref runtrack![runtrack.Length - 1];
if (trackingpos < runtextpos)
{
trackingpos = runtextpos;
}
advance = 0;
continue;
}

default:
Debug.Fail($"Unimplemented state: {_operator:X8}");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -380,8 +380,11 @@ internal RegexNode FinalOptimize()
// we can only consider unbounded loops, as to be able to start at the end of the loop we need the loop to have consumed all possible matches;
// otherwise, you could end up with a pattern like "a{1,3}b" matching against "aaaabc", which should match, but if we pre-emptively stop consuming
// after the first three a's and re-start from that position, we'll end up failing the match even though it should have succeeded. We can also
// apply this optimization to non-atomic loops. Even though backtracking could be necessary, such backtracking would be handled within the processing
// of a single starting position.
// apply this optimization to non-atomic loops: even though backtracking could be necessary, such backtracking would be handled within the processing
// of a single starting position. Lazy loops similarly benefit, as a failed match will result in exploring the exact same search space as with
// a greedy loop, just in the opposite order (and a successful match will overwrite the bumpalong position); we need to avoid atomic lazy loops,
// however, as they will only end up as a repeater for the minimum length and thus will effectively end up with a non-infinite upper bound, which
// we've already outlined is problematic.
{
RegexNode node = rootNode.Child(0); // skip implicit root capture node
while (true)
Expand All @@ -394,6 +397,7 @@ internal RegexNode FinalOptimize()
continue;

case Oneloop or Oneloopatomic or Notoneloop or Notoneloopatomic or Setloop or Setloopatomic when node.N == int.MaxValue:
case Onelazy or Notonelazy or Setlazy when node.N == int.MaxValue && !node.IsAtomicByParent():
RegexNode? parent = node.Next;
if (parent != null && parent.Type == Concatenate)
{
Expand Down
Loading

0 comments on commit 40047f2

Please sign in to comment.