Skip to content

Enable regex source gen / compiler vectorization of all sets #83992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 28, 2023

Conversation

stephentoub
Copy link
Member

When emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases. In .NET 7, this was largely limited to very small sets that contained only 1-5 characters. Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]). That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common. For example, the set [a-z] with RegexOptions.IgnoreCase actually gets expanded to be [A-Za-z\u212A] because the Kelvin sign is considered case-equivalent with k, and that means for such a set, we wouldn't employ any vectorized helper and would instead just walk character by character.

With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator. In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII. If the found result is ASCII, we're done. If the found result is non-ASCII, then we proceed to walk character by character as was done previously.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net.Http;
using System.Text.RegularExpressions;

public partial class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;

    [GeneratedRegex(@"\s+")] private static partial Regex FindWhitespace();
    [GeneratedRegex(@"\d+")] private static partial Regex FindNumbers();
    [GeneratedRegex(@"\b\w+\b")] private static partial Regex FindWords();
    [GeneratedRegex(@"\p{P}")] private static partial Regex FindPunctuation();

    [Benchmark] public int CountWhitespace() => FindWhitespace().Count(s_input);
    [Benchmark] public int CountNumbers() => FindNumbers().Count(s_input);
    [Benchmark] public int CountWords() => FindWords().Count(s_input);
    [Benchmark] public int CountPunctuation() => FindPunctuation().Count(s_input);
}

Before (source gen):

Method Mean Error StdDev
CountWhitespace 125.06 ms 0.407 ms 0.340 ms
CountNumbers 15.25 ms 0.027 ms 0.025 ms
CountWords 145.53 ms 2.324 ms 2.174 ms
CountPunctuation 37.27 ms 0.336 ms 0.263 ms

After (source gen):

Method Mean Error StdDev
CountWhitespace 98.698 ms 0.7504 ms 0.6652 ms
CountNumbers 2.520 ms 0.0158 ms 0.0124 ms
CountWords 146.225 ms 2.9665 ms 2.7748 ms
CountPunctuation 21.003 ms 0.0400 ms 0.0354 ms

@ghost
Copy link

ghost commented Mar 27, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

When emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases. In .NET 7, this was largely limited to very small sets that contained only 1-5 characters. Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]). That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common. For example, the set [a-z] with RegexOptions.IgnoreCase actually gets expanded to be [A-Za-z\u212A] because the Kelvin sign is considered case-equivalent with k, and that means for such a set, we wouldn't employ any vectorized helper and would instead just walk character by character.

With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator. In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII. If the found result is ASCII, we're done. If the found result is non-ASCII, then we proceed to walk character by character as was done previously.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net.Http;
using System.Text.RegularExpressions;

public partial class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;

    [GeneratedRegex(@"\s+")] private static partial Regex FindWhitespace();
    [GeneratedRegex(@"\d+")] private static partial Regex FindNumbers();
    [GeneratedRegex(@"\b\w+\b")] private static partial Regex FindWords();
    [GeneratedRegex(@"\p{P}")] private static partial Regex FindPunctuation();

    [Benchmark] public int CountWhitespace() => FindWhitespace().Count(s_input);
    [Benchmark] public int CountNumbers() => FindNumbers().Count(s_input);
    [Benchmark] public int CountWords() => FindWords().Count(s_input);
    [Benchmark] public int CountPunctuation() => FindPunctuation().Count(s_input);
}

Before (source gen):

Method Mean Error StdDev
CountWhitespace 125.06 ms 0.407 ms 0.340 ms
CountNumbers 15.25 ms 0.027 ms 0.025 ms
CountWords 145.53 ms 2.324 ms 2.174 ms
CountPunctuation 37.27 ms 0.336 ms 0.263 ms

After (source gen):

Method Mean Error StdDev
CountWhitespace 98.698 ms 0.7504 ms 0.6652 ms
CountNumbers 2.520 ms 0.0158 ms 0.0124 ms
CountWords 146.225 ms 2.9665 ms 2.7748 ms
CountPunctuation 21.003 ms 0.0400 ms 0.0354 ms
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 8.0.0

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

When emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases.  In .NET 7, this was largely limited to very small sets that contained only 1-5 characters.  Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]).  That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common.  For example, the set `[a-z]` with `RegexOptions.IgnoreCase` actually gets expanded to be `[A-Za-z\u212A]` because the Kelvin sign is considered case-equivalent with `k`, and that means for such a set, we wouldn't employ any vectorized helper and would instead just walk character by character.

With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator.  In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII.  If the found result is ASCII, we're done.  If the found result is non-ASCII, then we proceed to walk character by character as was done previously.
Copy link
Member Author

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All feedback addressed.

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is great @stephentoub, Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants