You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several improvements / simplifications in Regex (#100315)
* Several improvements / simplifications in Regex
This started out as a small improvement for one thing and grew to be something else.
Initially, my intent was just to improve how `SearchValues<char>` applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits `[\u0000-\u007F-[0-9]]`. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down. We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: **this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set**.
However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII). **This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.**
This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, **so we can just do the range check first and skip the more expensive set check if it's not needed.**
That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. **This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate.**.
* Update src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
---------
Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
subsequent?.FindStartingLiteral(4)isRegexNode.StartingLiteralDataliteral&&// 5 == max efficiently optimized by IndexOfAny, and we need to reserve 1 for node.Ch
(literal.Range.LowInclusive<=node.Ch&&node.Ch<=literal.Range.HighInclusive)))// for ranges, only allow when the range overlaps with the target, since there's no accelerated way to search for the union
3520
3521
{
@@ -3546,18 +3547,6 @@ literal.SetChars is not null ||
(literal.Range.LowInclusive<=node.Ch&&node.Ch<=literal.Range.HighInclusive)))// for ranges, only allow when the range overlaps with the target, since there's no accelerated way to search for the union
3597
3596
{
@@ -3660,18 +3659,6 @@ literal.SetChars is not null ||
3660
3659
break;
3661
3660
}
3662
3661
}
3663
-
elseif(literal.AsciiCharsis not null)// set of only ASCII characters
Debug.Fail("We should never get here. This method should only be called if CanEmitIndexOf returned true, and all of the same cases should be covered.");
0 commit comments