-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Text.RegularExpressions performance regressions #65187
Comments
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsLatest manual perf run (.NET 6 vs .NET 7 preview 1) has discovered multiple regressions in System.Text.RegularExpressions. Since most of them were not reported previously, I am opening the issue, so the owners of System.Text.RegularExpressions can triage them accordingly. I've uploaded a full report here. It's sorted desc from biggest regression to biggest improvement. The @performanceautofiler has not reported these regressions as they were most likely introduced before the benchmarks were added in November 2021 (historical data) Sample repro: git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 net7.0 --filter System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count Sample results: System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock|Holmes|Watson|Irene|Adler|John|Baker", Options: Compiled)
System.Text.RegularExpressions.Tests.Perf_Regex_Common.CtorInvoke(Options: Compiled)
(cc @stephentoub )
|
But overall the benchmark hits the weakest spot in the new |
While I might enjoy passing the blame to you for this one @EgorBo 😜 I don't think this is at all your fault. First, for anyone reading, I'll call out that while this one specific test regressed, overall things have improved dramatically (and that's beyond all the previous improvements in .NET 5). Looking at @adamsitnik's cited report above, searching for Slower yields 474 results, Same yields 1,555 results, and Faster yields 2,318 results, so many more have improved than regressed. And some of those improvements are staggering, e.g. Regex is one of those areas where it's difficult to look at just one result and make claims about improvements or regressions, as many optimizations play a guessing game as to what scenarios are most common and optimize for those at the expense potentially of others (as is the case with many optimizations, something common improves significantly in exchange for something expected to be less common regressing a bit). In this particular case, the pattern "Sherlock|Holmes|Watson|Irene|Adler|John|Baker" suffers from a change that is typically beneficial but in this particular case is a step backwards. To handle finding a match, we want to skip ahead as quickly as possible to the next potential place that could possibly match. Previously, we would do that by looking at the first character of each branch (so 'S', 'H', 'W', 'I', 'A', 'J', and 'B'). This set is too large to be vectorized by IndexOfAny as it stands today, so we instead build a bitmap lookup and walk each character in the input looking for the first character that has a bit set. However, .NET 7 now recognizes that, although there are 7 branches, other positions have overlap, e.g. it sees that if it were to instead focus on the second letter in each branch rather than the first, that would yield a set 'h', 'o', 'a', 'r', 'd', which is only 5 characters, and IndexOfAny does vectorize when there are 5 inputs. So in .NET 7, rather than going character by character in the input, it instead does an IndexOfAny("adhor"). This is typically a good choice, because it's able to process 8 or 16 characters at a time rather than 1 at a time. But in this particular case, there are lots of those letters in the input, and so the vectorized loop ends up stopping more frequently to further examine whether it's a good place to try to do a full match, and that ends up costing us. Hence the regression. I knew this at the time I made the change, which improved other things as much and frequently more than this particular case. There's very likely tweaking that could be done in the heuristics to improve this, but it'd still be a guessing game around all the tradeoffs involved. It'd be great if IndexOfAny could be improved further, as Regex now makes very heavy use of it. It'd also be really cool if someone could explore a .NET implementation of Teddy (#62447). |
Thanks for the analysis and for the explanation for the specific case that was called out @stephentoub. Also, thanks @adamsitnik for running these tests and for the report. Looks like at least for the 2 specific tests called out above, we probably would want to not fix those, since as @stephentoub points out, these were less common cases that got a bit regressed in order to provide a much larger win for more common cases. @stephentoub is it feasible to plan to take a quick look to the biggest regressions of those 474 called out and ensuring that they are all cases of “optimizing for a more common case” as opposed to accidental regressions? Or are we confident enough that this is the case and that fir most accidental regression we already have a bug for logged? I mostly ask in order to know what to do with this issue. |
Sure. We can also probably invest in reducing Regex ctor overhead. We've driven down execution time in part by doing more analysis during construction, and we've considered that in general a worthwhile tradeoff, but we haven't spend much time seeing whether we could be doing the same analysis for less, or whether we might want to not do some of the analysis when using the interpreter, etc. |
Latest manual perf run (.NET 6 vs .NET 7 preview 1) has discovered multiple regressions in System.Text.RegularExpressions. Since most of them were not reported previously, I am opening the issue, so the owners of System.Text.RegularExpressions can triage them accordingly.
I've uploaded a full report here. It's sorted desc from biggest regression to biggest improvement.
The @performanceautofiler has not reported these regressions as they were most likely introduced before the benchmarks were added in November 2021 (historical data)
Sample repro:
Sample results:
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock|Holmes|Watson|Irene|Adler|John|Baker", Options: Compiled)
System.Text.RegularExpressions.Tests.Perf_Regex_Common.CtorInvoke(Options: Compiled)
(cc @stephentoub )
The text was updated successfully, but these errors were encountered: