Huge performance regression in Regex.Replace(string, string) #44808

iSazonov · 2020-11-17T16:56:18Z

Description

In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1).

After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory.

Please see PowerShell/PowerShell#14087 for details. PerfView files are attached there.

Regression?

Yes, it is regression in .Net 5.0 from .Net 3.1.

ghost · 2020-11-17T16:56:22Z

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

Issue Details

Description:	Description In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1). After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory. Please see PowerShell/PowerShell#14087 for details. PerfView files are attached there. Regression? Yes, it is regression in .Net 5.0 from .Net 3.1.
Author:	iSazonov
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`, `untriaged`
Milestone:	-

stephentoub · 2020-11-17T16:58:25Z

How many replacements are being performed? If we're seeing tons of ReadOnlyMemory<char>[] being created from Regex.Replace, that suggests the size of the array is too big for ArrayPool to cache, which suggests over a million replacements are being performed as part of a single Regex.Replace?

stephentoub · 2020-11-17T17:01:37Z

Also, can you put together a standalone, simple C# repro focused on direct use of Regex.Replace? Such investigations are much easier when they come in such a form.

iSazonov · 2020-11-17T17:22:58Z

For the PerfVew results I used a data file with 100000 lines (it is first lines from original file shared in PowerShell issue by author) and the test script has 54 replace operators called per line. Line size is < 70 chars.

I could try to prepare C# code but only tomorrow (my time zone is UTC+5).

stephentoub · 2020-11-17T17:23:39Z

I could try to prepare C# code but only tomorrow (my time zone is UTC+5).

That'd be helpful. Thanks.

stephentoub · 2020-11-17T17:24:18Z

the test script has 54 replace operators called per line. Line size is < 70 chars

It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters? If so this is very surprising. Is there any parallelism being employed, or is all use of regex sequential?

iSazonov · 2020-11-17T17:31:13Z

the test script has 54 replace operators called per line. Line size is < 70 chars

It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters?

Here statistics for the input file:

Count Length
----- ----
    1 0
    2 1
   14 5
  135 6
  226 7
  539 8
 1029 9
 1487 10
 2058 11
 3904 12
 4415 13
 4255 14
 5490 15
 4947 16
 5059 17
 4703 18
 5490 19
 6525 20
 4354 21
 7114 22
 3285 23
 4032 24
 2443 25
 1532 26
 8881 27
 2661 28
 1747 29
 1498 30
 1179 31
 2989 32
  684 33
  917 34
 1195 35
  502 36
  439 37
  313 38
  364 39
  296 40
  237 41
  147 42
  405 43
   85 44
  185 45
  263 46
   73 47
   83 48
   85 49
   62 50
   54 51
   99 52
  221 53
   40 54
   33 55
   90 56
   95 57
   22 58
   41 59
  106 60
  275 61
    8 62
   24 63
   10 64
   14 65
   58 66
   14 67
    8 68
   11 69
   12 70
   13 71
    6 72
   76 73
    3 74
   11 75
  252 76
    1 77
    6 78
    2 79
    1 80
    4 81
    5 82
    5 83
    4 84
    3 85
    3 86
    1 88
    1 89
    1 90
    2 91
    1 92
    1 93
    1 94
    1 95
    1 100
    2 105
    2 106
    1 107
    1 108
    1 109
    1 110
    1 111
    2 112
    3 113
    4 114
    1 115
    1 118
    3 121
    1 124
    3 125
    1 126
    1 132
    1 141
    1 253

stephentoub · 2020-11-17T17:34:08Z

Here statistics for the input file:

Thanks, but to clarify my question, it's invoking Regex.Replace once per input line per replacement pattern, and every string input to Regex.Replace is going to be relatively short, e.g. less than a few hundred characters? I suspect that's not actually the case, or if it is, something is very wrong.

iSazonov · 2020-11-17T17:44:08Z

The script read the data file a line by line, for each line it applies 54 replace operations - each operation is applied to the result of the previous one.
Most of lines is short (max size 253 for 1 line).

daxian-dbw · 2020-11-17T17:53:10Z

@stephentoub The repro script calls -replace (which calls to regex.Replace(string, string)) 54 times per input line with different replacement patterns.

        # Removes junk lines of text
        $data = Get-Content $aioTextFile -Encoding UTF8

        # Removes some substrings within the data
        $data = $data | ForEach-Object { `
                $_ -replace '127\.0\.0\.1', '' -replace '0\.0\.0\.0', '' `
                -replace ' ', '' -replace '\t', '' -replace '\|\|', '' `
                -replace '\^', '' -replace 'www\.', '' `
                -replace '#Shock-Porn', '' -replace '#AAAEcommerce', '' `
                -replace '#(Vietnam)', '' -replace '#Acoustic', '' `
                -replace '#Albert', '' -replace '#Admedo', '' `
                -replace '#AdTriba', '' -replace '#AgentStudio', '' `
                -replace '#Algolia', '' -replace '#Race', '' `
                -replace '#Email,SMS,Popups', '' -replace '#Appier', '' `
                -replace '#ZineOne', '' -replace '#DeadBabies(Anti-Abortion)', '' `
                -replace '#AdobeDigitalMarketingSuite', '' -replace '#Race', '' `
                -replace '#Attentive', '' -replace '#Auctiva', '' `
                -replace '#Beelert', '' -replace '#Granify', '' `
                -replace '#Gore', '' -replace '#ExtremistGroup(Uncertain)', '' `
                -replace '#SixBit', '' -replace '#DigitalMarketing', '' `
                -replace '#Race/Gender(SimonSheppard)', '' -replace '#BoldCommerce', '' `
                -replace '#BridgeWell', '' -replace '#BuckSense', '' `
                -replace '#Candid.io', '' -replace '#SailThru', '' `
                -replace '#PCM,Inc.', '' -replace '#Frooition', '' `
                -replace '#RichContext', '' -replace '#SearchSpring', '' `
                -replace '#SendPulse', '' -replace '#Kibo', '' `
                -replace '#Satanism', '' -replace '#ClearLink', '' `
                -replace '#CleverTap', '' -replace '#ClickFirstMarketing', '' `
                -replace '\&visitor=\*\&referrer=', '' -replace '://', '' `
                -replace '\:\:1ip6-localhostip6-loopbacklocalhost6localhost6\.localdomain6', '' `
                -replace '\:\:1localhost', '' -replace '\?action=log_promo_impression', '' `
                -replace '\[hidethisheader(viewthelistonitsown)', '' `
                -replace '\[viewthelistinadifferentformat\|aboutthislist\|viewtheblocklistofIPaddresses', ''
        }

GrabYourPitchforks · 2020-11-17T18:20:19Z

Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal string.Replace API in those cases.

danmoseley · 2020-11-17T19:10:20Z

That would also be a case where it would be nice to have an API to avoid allocating intermediate strings.

stephentoub · 2020-11-17T20:40:59Z

if it is, something is very wrong.

Yeah, something was very wrong: when there were no replacements to be made, we were neglecting to return a buffer to the pool. I'll put up a PR.

stephentoub · 2020-11-17T21:26:03Z

Fix is at #44833

stephentoub · 2020-11-17T21:37:59Z

Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal string.Replace API in those cases.

It's possible, if the regex parse tree contains just a single character or a "multi" (just a text string), if the options are configured appropriately (default would be appropriate), and with the string-based overload (rather than the callback-based overload). I did a quick hack to see what it would save, and for this specific repro, it would appx double throughput.

pgovind · 2020-11-17T21:45:59Z

if the regex parse tree contains just a single character or a "multi"

I was going through this just yesterday. This would go down the Boyer Moore path right? I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM. I'm wondering if our BM implementation can be optimized/vectorized further. Anyway, this is just an aside, I'm looking at your PR now :)

stephentoub · 2020-11-17T22:45:01Z

I was going through this just yesterday. This would go down the Boyer Moore path right?

It does, yes.

I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM.

That's likely a contributor, though for many of these the advance distance is pretty large, so I suspect it's not the bulk of the difference. There's non-trivial code to traverse through to get from the public Regex.Replace down to the inner Boyer-Moore search loop in FindFirstChar; if I had to guess, I'd wager a non-trivial chunk of the difference came from my just replacing Regex.Replace with string.Replace in my hack, and thus skipping all of that overhead just getting to the place where the search happens.

iSazonov · 2020-11-18T03:39:43Z

@stephentoub I see you found and fixed one problem. Thanks!
In my original test, the result was 112 seconds in all 5.0 preview versions, but in the final version it was over 300 seconds. I'm afraid there was something else changed that we still haven't covered here.

stephentoub · 2020-11-18T11:05:49Z

In my original test, the result was 112 seconds in all 5.0 preview versions, but in the final version it was over 300 seconds. I'm afraid there was something else changed that we still haven't covered here.

What test?

iSazonov · 2020-11-18T12:38:33Z

I mean PowerShell/PowerShell#14087 (comment) - the same test on all preview versions ~112sec, on GA ~307sec.

stephentoub · 2020-11-18T12:47:08Z

Does PowerShell execute the regex exactly as provided, or does it prepend anything to it, like .* ?

iSazonov · 2020-11-18T14:23:51Z

I don't see in PowerShell code any modifications of input regex string for replace operator.

stephentoub · 2020-11-18T14:28:22Z

In that case I don't know why there would be any difference btw previews and release. If you can come up with a repro for a different regression from 3.1, please open a new issue. Thanks.

iSazonov · 2020-11-18T14:49:58Z

@stephentoub When can we get # 44833? Will it be in 6.0 Preview1? After that we could re-run the test in PowerShell and feedback if needed.

stephentoub · 2020-11-18T14:52:29Z

@iSazonov, it's already merged into master, so any 6.0 preview would have it; it should also be in nightly builds starting today if you wanted to try running with one of those. And #44837 is for possibly getting it into 5.0.x servicing.

danmoseley · 2020-11-18T16:58:39Z

Returning to my original comment -- if this pattern of heavily chained replacements is a Powershell idiom, I could imagine Powershell could recognize it and use some API optimized for this purpose (to avoid intermediates). Perhaps that would look like part of #23602 which we already discussed.

iSazonov · 2020-11-18T17:38:22Z

@danmosemsft Most PowerShell users are not experienced developers (not even PowerShell) and can create the most incredible constructs. Of course, a chain of 54 substitution operators is not a typical case, but from my experience I would say that such chains are used - this is one of the simple and pleasant features of PowerShell. If we think that this is how users process modern logs, which are tens of gigabytes in size, then performance is undoubtedly important.
If there will be a new API that allows chaining, then PowerShell will not be able to immediately benefit from it - we will need to either improve the PowerShell engine or the operator itself. Also we could think about search a string by one pattern and then replace in the result string by another pattern.

I'd speculate that Regex in common is too complex for end users and most of Regex pattern PowerShell users create is very, very simple. If we look the 54 patterns they all is simple. I guess .Net team could use the fact to create allocationless Regex optimizations.

stephentoub · 2020-11-18T17:40:04Z

I guess .Net team could use the fact to create allocationless Regex optimizations.

The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced).

iSazonov · 2020-11-23T05:08:22Z

I guess .Net team could use the fact to create allocationless Regex optimizations.

The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced).

My note was for chained scenario.

I can confirm the fix works. (I don't compile PowerShell with nightly .Net 6.0 build and only copy-paste the dlls - after that the original script works as fast as with .Net 5.0.)

iSazonov added the tenet-performance Performance related issue label Nov 17, 2020

Dotnet-GitSync-Bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Nov 17, 2020

danmoseley added the regression-from-last-release label Nov 17, 2020

stephentoub self-assigned this Nov 17, 2020

stephentoub removed the untriaged New issue has not been triaged by the area owner label Nov 17, 2020

stephentoub added this to the 6.0.0 milestone Nov 17, 2020

stephentoub added the bug label Nov 17, 2020

stephentoub mentioned this issue Nov 17, 2020

Return pooled arrays in Regex.Replace when no replacements #44833

Merged

stephentoub mentioned this issue Nov 17, 2020

[release/5.0] Return pooled arrays in Regex.Replace when no replacements #44837

Merged

stephentoub closed this as completed Nov 20, 2020

daxian-dbw mentioned this issue Dec 2, 2020

Reduced Performance in PowerShell 7.1? PowerShell/PowerShell#14087

Closed

ghost locked as resolved and limited conversation to collaborators Dec 23, 2020

stephentoub modified the milestones: 6.0.0, 5.0.1 Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge performance regression in Regex.Replace(string, string) #44808

Huge performance regression in Regex.Replace(string, string) #44808

iSazonov commented Nov 17, 2020

ghost commented Nov 17, 2020

Description

Regression?

stephentoub commented Nov 17, 2020

stephentoub commented Nov 17, 2020 •

edited

Loading

iSazonov commented Nov 17, 2020

stephentoub commented Nov 17, 2020

stephentoub commented Nov 17, 2020 •

edited

Loading

iSazonov commented Nov 17, 2020

stephentoub commented Nov 17, 2020

iSazonov commented Nov 17, 2020

daxian-dbw commented Nov 17, 2020

GrabYourPitchforks commented Nov 17, 2020

danmoseley commented Nov 17, 2020

stephentoub commented Nov 17, 2020

stephentoub commented Nov 17, 2020

stephentoub commented Nov 17, 2020

pgovind commented Nov 17, 2020

stephentoub commented Nov 17, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

danmoseley commented Nov 18, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

iSazonov commented Nov 23, 2020

Huge performance regression in Regex.Replace(string, string) #44808

Huge performance regression in Regex.Replace(string, string) #44808

Comments

iSazonov commented Nov 17, 2020

Description

Regression?

ghost commented Nov 17, 2020

Description

Regression?

stephentoub commented Nov 17, 2020

stephentoub commented Nov 17, 2020 • edited Loading

iSazonov commented Nov 17, 2020

stephentoub commented Nov 17, 2020

stephentoub commented Nov 17, 2020 • edited Loading

iSazonov commented Nov 17, 2020

stephentoub commented Nov 17, 2020

iSazonov commented Nov 17, 2020

daxian-dbw commented Nov 17, 2020

GrabYourPitchforks commented Nov 17, 2020

danmoseley commented Nov 17, 2020

stephentoub commented Nov 17, 2020

stephentoub commented Nov 17, 2020

stephentoub commented Nov 17, 2020

pgovind commented Nov 17, 2020

stephentoub commented Nov 17, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

danmoseley commented Nov 18, 2020

iSazonov commented Nov 18, 2020

stephentoub commented Nov 18, 2020

iSazonov commented Nov 23, 2020

stephentoub commented Nov 17, 2020 •

edited

Loading

stephentoub commented Nov 17, 2020 •

edited

Loading