-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Enable regex source gen / compiler vectorization of all sets #83992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsWhen emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases. In .NET 7, this was largely limited to very small sets that contained only 1-5 characters. Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]). That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common. For example, the set With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator. In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII. If the found result is ASCII, we're done. If the found result is non-ASCII, then we proceed to walk character by character as was done previously. using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net.Http;
using System.Text.RegularExpressions;
public partial class Program
{
static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;
[GeneratedRegex(@"\s+")] private static partial Regex FindWhitespace();
[GeneratedRegex(@"\d+")] private static partial Regex FindNumbers();
[GeneratedRegex(@"\b\w+\b")] private static partial Regex FindWords();
[GeneratedRegex(@"\p{P}")] private static partial Regex FindPunctuation();
[Benchmark] public int CountWhitespace() => FindWhitespace().Count(s_input);
[Benchmark] public int CountNumbers() => FindNumbers().Count(s_input);
[Benchmark] public int CountWords() => FindWords().Count(s_input);
[Benchmark] public int CountPunctuation() => FindPunctuation().Count(s_input);
} Before (source gen):
After (source gen):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
When emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases. In .NET 7, this was largely limited to very small sets that contained only 1-5 characters. Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]). That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common. For example, the set `[a-z]` with `RegexOptions.IgnoreCase` actually gets expanded to be `[A-Za-z\u212A]` because the Kelvin sign is considered case-equivalent with `k`, and that means for such a set, we wouldn't employ any vectorized helper and would instead just walk character by character. With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator. In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII. If the found result is ASCII, we're done. If the found result is non-ASCII, then we proceed to walk character by character as was done previously.
5dd0384
to
75c6b2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All feedback addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This is great @stephentoub, Thanks!
When emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases. In .NET 7, this was largely limited to very small sets that contained only 1-5 characters. Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]). That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common. For example, the set
[a-z]
withRegexOptions.IgnoreCase
actually gets expanded to be[A-Za-z\u212A]
because the Kelvin sign is considered case-equivalent withk
, and that means for such a set, we wouldn't employ any vectorized helper and would instead just walk character by character.With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator. In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII. If the found result is ASCII, we're done. If the found result is non-ASCII, then we proceed to walk character by character as was done previously.
Before (source gen):
After (source gen):