-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several improvements / simplifications in Regex #100315
Conversation
This started out as a small improvement for one thing and grew to be something else. Initially, my intent was just to improve how `SearchValues<char>` applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits `[\u0000-\u007F-[0-9]]`. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down. We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: **this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set**. However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII). **This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.** This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, **so we can just do the range check first and skip the more expensive set check if it's not needed.** That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. **This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate.**.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion
The probabilistic map currently only has a vectorized path for IndexOfAny
.
We could add LastIndexOfAny
pretty easily (the interesting code would be shared with IndexOfAny
). I take it this change might make that more important?
We don't, however, have a vectorized fallback for AnyExcept
after falling off the Ascii fast path (no "inverse bloom filter").
Currently, we'll fall back to an O(i * m)
loop for that. We could make it O(i)
, but it'd still be char-by-char.
Is that a concern here?
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
…r.Emitter.cs Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
Yeah, that'd be useful to add in, in particular if we can do it without adding in much code.
Why would it be |
The single-character check when using the probabilistic map is currently |
Ok, I hadn't realized that (or maybe I knew it at one point and forgot). It'd be good to switch it to using a set lookup. |
I just wanted y'all to know I love reading commit notes like this! |
Possible related Mono regressions: |
The regressions mostly look to be places we're now using while ((uint)iteration < (uint)slice.Length && ((ch = slice[iteration]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\f\0-/0:A[_`a{KÅ")))
{
iteration++;
} is now generated to be: int iteration = slice.IndexOfAnyExcept(Utilities.s_nonAscii_2D303132333435363738394142434445464748494A4B4C4D4E4F505152535455565758595A6162636465666768696A6B6C6D6E6F707172737475767778797AE284AA);
if (iteration < 0)
{
iteration = slice.Length;
}
...
internal static readonly SearchValues<char> s_nonAscii_2D303132333435363738394142434445464748494A4B4C4D4E4F505152535455565758595A6162636465666768696A6B6C6D6E6F707172737475767778797AE284AA = SearchValues.Create("-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK"); @MihaZupan, will this hit problematic paths? (Note the non-ASCII Kelvin sign at the end there.) |
Yes, If have have explicit support for ISAs ( I've been playing around with a few options here (e.g. perfect hash / lookup table ...), I'll still have to throw a few more hours at it though. |
E.g. for
|
"pr" here is with your upcoming fixes? |
Yeah, should be on that order of magnitude |
* Several improvements / simplifications in Regex This started out as a small improvement for one thing and grew to be something else. Initially, my intent was just to improve how `SearchValues<char>` applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits `[\u0000-\u007F-[0-9]]`. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down. We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: **this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set**. However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII). **This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.** This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, **so we can just do the range check first and skip the more expensive set check if it's not needed.** That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. **This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate.**. * Update src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com> --------- Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
This started out as a small improvement for one thing and grew to be something else.
Initially, my intent was just to improve how
SearchValues<char>
applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits[\u0000-\u007F-[0-9]]
. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down. We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set.However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII). This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.
This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, so we can just do the range check first and skip the more expensive set check if it's not needed.
That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate..