-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge performance regression in Regex.Replace(string, string) #44808
Comments
Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley Issue Details
|
How many replacements are being performed? If we're seeing tons of |
Also, can you put together a standalone, simple C# repro focused on direct use of Regex.Replace? Such investigations are much easier when they come in such a form. |
For the PerfVew results I used a data file with 100000 lines (it is first lines from original file shared in PowerShell issue by author) and the test script has 54 replace operators called per line. Line size is < 70 chars. I could try to prepare C# code but only tomorrow (my time zone is UTC+5). |
That'd be helpful. Thanks. |
It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters? If so this is very surprising. Is there any parallelism being employed, or is all use of regex sequential? |
Here statistics for the input file:
|
Thanks, but to clarify my question, it's invoking Regex.Replace once per input line per replacement pattern, and every string input to Regex.Replace is going to be relatively short, e.g. less than a few hundred characters? I suspect that's not actually the case, or if it is, something is very wrong. |
The script read the data file a line by line, for each line it applies 54 replace operations - each operation is applied to the result of the previous one. |
@stephentoub The repro script calls # Removes junk lines of text
$data = Get-Content $aioTextFile -Encoding UTF8
# Removes some substrings within the data
$data = $data | ForEach-Object { `
$_ -replace '127\.0\.0\.1', '' -replace '0\.0\.0\.0', '' `
-replace ' ', '' -replace '\t', '' -replace '\|\|', '' `
-replace '\^', '' -replace 'www\.', '' `
-replace '#Shock-Porn', '' -replace '#AAAEcommerce', '' `
-replace '#(Vietnam)', '' -replace '#Acoustic', '' `
-replace '#Albert', '' -replace '#Admedo', '' `
-replace '#AdTriba', '' -replace '#AgentStudio', '' `
-replace '#Algolia', '' -replace '#Race', '' `
-replace '#Email,SMS,Popups', '' -replace '#Appier', '' `
-replace '#ZineOne', '' -replace '#DeadBabies(Anti-Abortion)', '' `
-replace '#AdobeDigitalMarketingSuite', '' -replace '#Race', '' `
-replace '#Attentive', '' -replace '#Auctiva', '' `
-replace '#Beelert', '' -replace '#Granify', '' `
-replace '#Gore', '' -replace '#ExtremistGroup(Uncertain)', '' `
-replace '#SixBit', '' -replace '#DigitalMarketing', '' `
-replace '#Race/Gender(SimonSheppard)', '' -replace '#BoldCommerce', '' `
-replace '#BridgeWell', '' -replace '#BuckSense', '' `
-replace '#Candid.io', '' -replace '#SailThru', '' `
-replace '#PCM,Inc.', '' -replace '#Frooition', '' `
-replace '#RichContext', '' -replace '#SearchSpring', '' `
-replace '#SendPulse', '' -replace '#Kibo', '' `
-replace '#Satanism', '' -replace '#ClearLink', '' `
-replace '#CleverTap', '' -replace '#ClickFirstMarketing', '' `
-replace '\&visitor=\*\&referrer=', '' -replace '://', '' `
-replace '\:\:1ip6-localhostip6-loopbacklocalhost6localhost6\.localdomain6', '' `
-replace '\:\:1localhost', '' -replace '\?action=log_promo_impression', '' `
-replace '\[hidethisheader(viewthelistonitsown)', '' `
-replace '\[viewthelistinadifferentformat\|aboutthislist\|viewtheblocklistofIPaddresses', ''
} |
Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal |
That would also be a case where it would be nice to have an API to avoid allocating intermediate strings. |
Yeah, something was very wrong: when there were no replacements to be made, we were neglecting to return a buffer to the pool. I'll put up a PR. |
Fix is at #44833 |
It's possible, if the regex parse tree contains just a single character or a "multi" (just a text string), if the options are configured appropriately (default would be appropriate), and with the string-based overload (rather than the callback-based overload). I did a quick hack to see what it would save, and for this specific repro, it would appx double throughput. |
I was going through this just yesterday. This would go down the Boyer Moore path right? I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM. I'm wondering if our BM implementation can be optimized/vectorized further. Anyway, this is just an aside, I'm looking at your PR now :) |
It does, yes.
That's likely a contributor, though for many of these the advance distance is pretty large, so I suspect it's not the bulk of the difference. There's non-trivial code to traverse through to get from the public Regex.Replace down to the inner Boyer-Moore search loop in FindFirstChar; if I had to guess, I'd wager a non-trivial chunk of the difference came from my just replacing Regex.Replace with string.Replace in my hack, and thus skipping all of that overhead just getting to the place where the search happens. |
@stephentoub I see you found and fixed one problem. Thanks! |
What test? |
I mean PowerShell/PowerShell#14087 (comment) - the same test on all preview versions ~112sec, on GA ~307sec. |
Does PowerShell execute the regex exactly as provided, or does it prepend anything to it, like .* ? |
I don't see in PowerShell code any modifications of input regex string for replace operator. |
In that case I don't know why there would be any difference btw previews and release. If you can come up with a repro for a different regression from 3.1, please open a new issue. Thanks. |
@stephentoub When can we get # 44833? Will it be in 6.0 Preview1? After that we could re-run the test in PowerShell and feedback if needed. |
Returning to my original comment -- if this pattern of heavily chained replacements is a Powershell idiom, I could imagine Powershell could recognize it and use some API optimized for this purpose (to avoid intermediates). Perhaps that would look like part of #23602 which we already discussed. |
@danmosemsft Most PowerShell users are not experienced developers (not even PowerShell) and can create the most incredible constructs. Of course, a chain of 54 substitution operators is not a typical case, but from my experience I would say that such chains are used - this is one of the simple and pleasant features of PowerShell. If we think that this is how users process modern logs, which are tens of gigabytes in size, then performance is undoubtedly important. I'd speculate that Regex in common is too complex for end users and most of Regex pattern PowerShell users create is very, very simple. If we look the 54 patterns they all is simple. I guess .Net team could use the fact to create allocationless Regex optimizations. |
The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced). |
My note was for chained scenario. I can confirm the fix works. (I don't compile PowerShell with nightly .Net 6.0 build and only copy-paste the dlls - after that the original script works as fast as with .Net 5.0.) |
Description
In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1).
After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory.
Please see PowerShell/PowerShell#14087 for details. PerfView files are attached there.
Regression?
Yes, it is regression in .Net 5.0 from .Net 3.1.
The text was updated successfully, but these errors were encountered: