Enable regex source gen / compiler vectorization of all sets #83992

stephentoub · 2023-03-27T19:52:38Z

When emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases. In .NET 7, this was largely limited to very small sets that contained only 1-5 characters. Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]). That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common. For example, the set [a-z] with RegexOptions.IgnoreCase actually gets expanded to be [A-Za-z\u212A] because the Kelvin sign is considered case-equivalent with k, and that means for such a set, we wouldn't employ any vectorized helper and would instead just walk character by character.

With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator. In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII. If the found result is ASCII, we're done. If the found result is non-ASCII, then we proceed to walk character by character as was done previously.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net.Http;
using System.Text.RegularExpressions;

public partial class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;

    [GeneratedRegex(@"\s+")] private static partial Regex FindWhitespace();
    [GeneratedRegex(@"\d+")] private static partial Regex FindNumbers();
    [GeneratedRegex(@"\b\w+\b")] private static partial Regex FindWords();
    [GeneratedRegex(@"\p{P}")] private static partial Regex FindPunctuation();

    [Benchmark] public int CountWhitespace() => FindWhitespace().Count(s_input);
    [Benchmark] public int CountNumbers() => FindNumbers().Count(s_input);
    [Benchmark] public int CountWords() => FindWords().Count(s_input);
    [Benchmark] public int CountPunctuation() => FindPunctuation().Count(s_input);
}

Before (source gen):

Method	Mean	Error	StdDev
CountWhitespace	125.06 ms	0.407 ms	0.340 ms
CountNumbers	15.25 ms	0.027 ms	0.025 ms
CountWords	145.53 ms	2.324 ms	2.174 ms
CountPunctuation	37.27 ms	0.336 ms	0.263 ms

After (source gen):

Method	Mean	Error	StdDev
CountWhitespace	98.698 ms	0.7504 ms	0.6652 ms
CountNumbers	2.520 ms	0.0158 ms	0.0124 ms
CountWords	146.225 ms	2.9665 ms	2.7748 ms
CountPunctuation	21.003 ms	0.0400 ms	0.0354 ms

ghost · 2023-03-27T19:52:50Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

When emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases. In .NET 7, this was largely limited to very small sets that contained only 1-5 characters. Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]). That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common. For example, the set [a-z] with RegexOptions.IgnoreCase actually gets expanded to be [A-Za-z\u212A] because the Kelvin sign is considered case-equivalent with k, and that means for such a set, we wouldn't employ any vectorized helper and would instead just walk character by character.

With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator. In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII. If the found result is ASCII, we're done. If the found result is non-ASCII, then we proceed to walk character by character as was done previously.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Net.Http;
using System.Text.RegularExpressions;

public partial class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private static readonly string s_input = new HttpClient().GetStringAsync("https://www.gutenberg.org/cache/epub/3200/pg3200.txt").Result;

    [GeneratedRegex(@"\s+")] private static partial Regex FindWhitespace();
    [GeneratedRegex(@"\d+")] private static partial Regex FindNumbers();
    [GeneratedRegex(@"\b\w+\b")] private static partial Regex FindWords();
    [GeneratedRegex(@"\p{P}")] private static partial Regex FindPunctuation();

    [Benchmark] public int CountWhitespace() => FindWhitespace().Count(s_input);
    [Benchmark] public int CountNumbers() => FindNumbers().Count(s_input);
    [Benchmark] public int CountWords() => FindWords().Count(s_input);
    [Benchmark] public int CountPunctuation() => FindPunctuation().Count(s_input);
}

Before (source gen):

Method	Mean	Error	StdDev
CountWhitespace	125.06 ms	0.407 ms	0.340 ms
CountNumbers	15.25 ms	0.027 ms	0.025 ms
CountWords	145.53 ms	2.324 ms	2.174 ms
CountPunctuation	37.27 ms	0.336 ms	0.263 ms

After (source gen):

Method	Mean	Error	StdDev
CountWhitespace	98.698 ms	0.7504 ms	0.6652 ms
CountNumbers	2.520 ms	0.0158 ms	0.0124 ms
CountWords	146.225 ms	2.9665 ms	2.7748 ms
CountPunctuation	21.003 ms	0.0400 ms	0.0354 ms

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	8.0.0

MihaZupan

This is great!

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

When emitting TryFindNextPossibleStartingPosition based on a set that's a fixed-distance from the start of the pattern, today we're able to vectorize a variety of cases. In .NET 7, this was largely limited to very small sets that contained only 1-5 characters. Earlier in .NET 8, we expanded this to include ranges (e.g. [0-9]) and more impactfully sets containing any number of only ASCII values (e.g. [A-Za-z0-9]). That still, however, leaves sets that are non-continguous and contain anything non-ASCII, which is extremely common. For example, the set `[a-z]` with `RegexOptions.IgnoreCase` actually gets expanded to be `[A-Za-z\u212A]` because the Kelvin sign is considered case-equivalent with `k`, and that means for such a set, we wouldn't employ any vectorized helper and would instead just walk character by character. With this PR, we'lll now employ some vectorization as part of that remaining class of set as well in the compiler and source generator. In particular, we do a vectorize a search for the ASCII portion of the set along with anything non-ASCII. If the found result is ASCII, we're done. If the found result is non-ASCII, then we proceed to walk character by character as was done previously.

stephentoub

All feedback addressed.

joperezr

LGTM. This is great @stephentoub, Thanks!

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Mar 27, 2023

stephentoub added this to the 8.0.0 milestone Mar 27, 2023

stephentoub requested review from joperezr and MihaZupan March 27, 2023 19:52

ghost assigned stephentoub Mar 27, 2023

MihaZupan approved these changes Mar 27, 2023

View reviewed changes

build-analysis bot mentioned this pull request Mar 27, 2023

CI failure STRICT_JS doesn't work with MODULARIZE or EXPORT_ES6 #83986

Closed

MihaZupan reviewed Mar 27, 2023

View reviewed changes

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Show resolved Hide resolved

joperezr reviewed Mar 27, 2023

View reviewed changes

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request Mar 27, 2023

WasmTestOnBrowser-System.* test failures in CI #83655

Closed

stephentoub added 2 commits March 27, 2023 20:48

Address PR feedback

75c6b2d

stephentoub force-pushed the regexindexofset branch from 5dd0384 to 75c6b2d Compare March 28, 2023 01:23

stephentoub commented Mar 28, 2023

View reviewed changes

joperezr approved these changes Mar 28, 2023

View reviewed changes

stephentoub merged commit 2fcd592 into dotnet:main Mar 28, 2023

stephentoub deleted the regexindexofset branch March 28, 2023 10:18

stephentoub mentioned this pull request Mar 29, 2023

[API Proposal]: MemoryExtensions.IndexOfAny{Except}WhiteSpace #77959

Open

This was referenced Mar 30, 2023

[Perf] Windows/arm64: 3 Improvements on 3/28/2023 10:24:48 AM dotnet/perf-autofiling-issues#14628

Closed

[Perf] Windows/arm64: 6 Improvements on 3/28/2023 10:24:48 AM dotnet/perf-autofiling-issues#14621

Closed

This was referenced Mar 30, 2023

Regex should explicitly target IndexOfAnyExcept(char {, char, char}) when handling non-ASCII vectorized sets #84150

Closed

Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Common #84139

Closed

MihaZupan mentioned this pull request Mar 31, 2023

Improve IndexOfAnyValues throughput for needles with 0 #84184

Merged

ghost locked as resolved and limited conversation to collaborators Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable regex source gen / compiler vectorization of all sets #83992

Enable regex source gen / compiler vectorization of all sets #83992

stephentoub commented Mar 27, 2023

ghost commented Mar 27, 2023

MihaZupan left a comment

stephentoub left a comment

joperezr left a comment

Enable regex source gen / compiler vectorization of all sets #83992

Enable regex source gen / compiler vectorization of all sets #83992

Conversation

stephentoub commented Mar 27, 2023

ghost commented Mar 27, 2023

MihaZupan left a comment

Choose a reason for hiding this comment

stephentoub left a comment

Choose a reason for hiding this comment

joperezr left a comment

Choose a reason for hiding this comment