Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of certain build-in functions, esp the ones related to Array manipulation #9390

Closed
abelbraaksma opened this issue Jun 3, 2020 · 28 comments
Labels
Area-Library Issues for FSharp.Core not covered elsewhere Feature Request
Milestone

Comments

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jun 3, 2020

I've noticed for some time now that quite a few functions that are called in many places are suboptimal in terms of performance. For instance, Array.create, Array.init, Array.concat and Array.filter can be improved upon with sometimes about 1.2-1.5x the speed, sometimes even manyfold (I got Array.replicate times source |> Array.concat 25X faster with a dedicated Array.repeat, so that was a bit of cheating).

Sometimes such improvements relate specifically to large arrays, sometimes specifically to small ones. In line of that, String.xxx functions that use StringBuilder when replaced with ToCharArray() instead improve by about 20-30%.

Before I start making a PR, which would certainly include showing the timings with BenchmarkDotNet for a variety of inputs, would such a PR be welcome? While nothing I tried is really complex, and what I tried is provably correct, as always with optimizations, they will increase the complexity of the code.

To get some idea, here's String.map vs a very simple optimization of str.ToCharArray() |> Array.map mapper |> String, which in many cases wins 40% (needs more input lengths comparisons, though, this was with 12-char string):

Method Job Jit Platform Runtime Mean Error StdDev Min Max Ratio Rank Gen 0 Gen 1 Gen 2 Allocated
fastMap Job-MACJMZ LegacyJit X64 .NET 4.8 106.91ns 1.069ns 1.000ns 105.31ns 108.28ns 0.65 1 0.0263 - - 168 B
stringMap Job-MACJMZ LegacyJit X64 .NET 4.8 164.29ns 2.743ns 2.291ns 160.02ns 169.34ns 1.00 2 0.0315 - - 201 B
fastMap Job-CZIGBL LegacyJit X86 .NET 4.8 118.16ns 1.094ns 0.970ns 116.88ns 120.14ns 0.64 1 0.0219 - - 120 B
stringMap Job-CZIGBL LegacyJit X86 .NET 4.8 184.93ns 1.647ns 1.540ns 181.82ns 187.86ns 1.00 2 0.0233 - - 128 B
fastMap Job-FZSSVS RyuJit X64 .NET 4.8 95.61ns 0.974ns 0.911ns 94.33ns 97.43ns 0.63 1 0.0261 - - 168 B
stringMap Job-FZSSVS RyuJit X64 .NET 4.8 151.22ns 1.639ns 1.368ns 149.13ns 154.09ns 1.00 2 0.0315 - - 201 B
fastMap Job-OXSCJZ LegacyJit X86 .NET 4.6.1 117.96ns 1.202ns 1.065ns 116.12ns 119.48ns 0.62 1 0.0227 - - 120 B
stringMap Job-OXSCJZ LegacyJit X86 .NET 4.6.1 191.32ns 3.016ns 2.673ns 188.10ns 196.17ns 1.00 2 0.0237 - - 128 B
@cartermp
Copy link
Contributor

cartermp commented Jun 3, 2020

@abelbraaksma great observation. Would you be interested in submitting a PR to adjust these functions?

@abelbraaksma
Copy link
Contributor Author

@cartermp, sure thing, I just wasn't certain how much of "performance sugar" is allowed. I mean, I would assume using Span is not possible because we still support older .NET versions, but heavily-used functions like Array.create benefits tremendously with some loop-unrolling and block-alignment, for instance.

I'll submit a PR (will probably mark it WIP as this may need some iterations) and have you comment on whether certain performance-doping is allowed or not.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 3, 2020

@cartermp, one other question, for perf comparison, would it be enough to just run RuyJit with .NET 4.8 on 64 bit, plus one LegacyJit on x86, or should I make more system configurations (.NET Core, Standard, I don't have Mono installed...)? It should be enough to focus on the 80% users group right, and not focus too much on niche configurations?

@cartermp
Copy link
Contributor

cartermp commented Jun 3, 2020

I think .NET Core (latest) and .NET Framework (any version) would be good enough to run some benchmarks against.

Regarding performance sugar, yes Span can't be used because that would make FSharp.Core inaccessibly anywhere except for .NET Core and Xamarin. But anything else should be fine, so long as it doesn't change the public API or the end result of the functions.

@TIHan
Copy link
Contributor

TIHan commented Jun 4, 2020

I think a PR would be fine.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 8, 2020

After playing around a little with the various ways String.map can be optimized, I'm pasting the summarized results here. I expect to be able to use the "lessons learned" here with the other functions, so that I can settle with less variants to test.

What I tried:

  • Removing StringBuilder, as we know the size of the string beforehand.
  • Inlining the loop as opposed to Array.map and loop unrolling. The latter proved not so effective as can be seen below (x2 means 2 unrolls, x4 means 4)
  • while vs for, which should be indistinguishable mostly, and on .NET Core that's true, but somehow on .NET Framework (not shown here) while was significantly slower
  • several ways of allocating the new string
  • an approach with fixed and String.Copy. These are the fastest by far, and use least memory, but may not be "admissable" as it makes the function unsafe.
  • for comparison, I tried a view approaches with Span and String.Create. These cannot be used, as they require a dependency for .NET Framework. They are a good alternative to fixed, also little memory use and faster than the rest, except on small strings.
  • optimizing for small strings, which is smallMap in the overview. This is the winner on the low range, but slightly slower for larger input (though this may be measurement deviation, the code is the same as fasterMap, except for a few leading ifs).

My own verdict:

  • If possible, I would go for fixed without loop-unrolling. It is simple enough, and it uses least memory, and is 3x to even 4x faster than the current implementation. The larger the input, the bigger the difference.
  • Otherwise, the choice should be between fasterMap, simple and straightforward, but generally fastest without "doping", or smallMap, if we want to have top performance on strings up to 8 chars (above that, they're mostly equal).

Note that any alternative I tried was faster than the current implementation for average input. And almost all of them have less GC pressure and memory use.

I've also tested with very large inputs up until 100MB, usually the perf difference shift in favor of the new algorithms, because much less memory is used. With 100MB, fixed is almost 5x times faster than current.

Benchmarks were run with BDN, using 0.01 max relative error. Originally each test was run with a variety of mapper functions, but that made little change, so I left those variations out.

  • bold is fastest of the likely candidates
  • italics is fastest of the fixed versions
  • I didn't bother to tag Span and Alt, as I don't think they are possible candidates, they are here for comparison.
  • "current" means String.map as it is now in the FSharp.Core library. It is repeated for each block. Candidate means they are simplest to adopt, Alt means also simple, but slightly slower on .NET Framework than shown here, I don't prefer them. Fixed is fixed, and Span is span.
BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
  [Host]        : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
  .NET Core 3.1 : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

Job=.NET Core 3.1  Runtime=.NET Core 3.1  

0 and 1 chars

Method Categories Value Mean Error StdDev Median Ratio Rank Gen 0 Allocated
current Candidate 0 12.86ns 0.101ns 0.089ns 12.89ns 1.00 1 0.0038 24 B
fastMap Candidate 0 13.07ns 0.078ns 0.069ns 13.06ns 1.02 2 0.0038 24 B
fasterMap Candidate 0 12.75ns 0.071ns 0.067ns 12.76ns 0.99 1 0.0038 24 B
smallMap Candidate 0 16.57ns 0.097ns 0.091ns 16.53ns 1.29 3 0.0038 24 B
 
current Alt 0 12.76ns 0.173ns 0.154ns 12.76ns 1.00 1 0.0038 24 B
whileMap Alt 0 12.88ns 0.073ns 0.065ns 12.89ns 1.01 1 0.0038 24 B
allocateMap Alt 0 12.86ns 0.161ns 0.151ns 12.89ns 1.01 1 0.0038 24 B
unrollMapX2 Alt 0 15.88ns 0.213ns 0.199ns 15.78ns 1.25 3 0.0038 24 B
unrollMapX4 Alt 0 15.43ns 0.082ns 0.076ns 15.43ns 1.21 2 0.0038 24 B
 
current Fixed 0 12.59ns 0.172ns 0.161ns 12.51ns 1.00 1 0.0038 24 B
fixedCopy Fixed 0 28.35ns 0.361ns 0.338ns 28.19ns 2.25 2 0.0076 48 B
fixedCopyx2 Fixed 0 28.36ns 0.212ns 0.177ns 28.42ns 2.25 2 0.0076 48 B
fixedCopyx4 Fixed 0 28.51ns 0.300ns 0.281ns 28.57ns 2.26 2 0.0076 48 B
 
current Span 0 12.90ns 0.126ns 0.118ns 12.87ns 1.00 1 0.0038 24 B
spanCopy Span 0 28.60ns 0.169ns 0.141ns 28.65ns 2.22 2 0.0076 48 B
spanStrCreate Span 0 35.44ns 0.395ns 0.350ns 35.37ns 2.75 3 0.0178 112 B
spanStrCreateAlt Span 0 48.54ns 0.936ns 0.919ns 48.79ns 3.76 4 0.0229 144 B
spanStrCreateUnx4 Span 0 47.61ns 0.984ns 1.171ns 47.65ns 3.70 4 0.0229 144 B
 
current Candidate 1 69.09ns 0.852ns 0.756ns 69.10ns 1.00 4 0.0254 160 B
fastMap Candidate 1 52.83ns 0.654ns 0.612ns 52.60ns 0.76 3 0.0178 112 B
fasterMap Candidate 1 45.80ns 0.482ns 0.451ns 45.85ns 0.66 2 0.0127 80 B
smallMap Candidate 1 24.11ns 0.419ns 0.372ns 24.11ns 0.35 1 0.0076 48 B
 
current Alt 1 67.27ns 0.505ns 0.473ns 67.23ns 1.00 4 0.0254 160 B
whileMap Alt 1 46.82ns 0.499ns 0.467ns 46.75ns 0.70 3 0.0127 80 B
allocateMap Alt 1 43.11ns 0.635ns 0.563ns 42.95ns 0.64 2 0.0127 80 B
unrollMapX2 Alt 1 23.80ns 0.247ns 0.231ns 23.87ns 0.35 1 0.0076 48 B
unrollMapX4 Alt 1 23.59ns 0.148ns 0.138ns 23.64ns 0.35 1 0.0076 48 B
 
current Fixed 1 69.15ns 1.132ns 1.059ns 69.30ns 1.00 3 0.0254 160 B
fixedCopy Fixed 1 30.94ns 0.520ns 0.486ns 31.12ns 0.45 2 0.0076 48 B
fixedCopyx2 Fixed 1 30.36ns 0.396ns 0.371ns 30.20ns 0.44 1 0.0076 48 B
fixedCopyx4 Fixed 1 31.12ns 0.271ns 0.226ns 31.07ns 0.45 2 0.0076 48 B
 
current Span 1 68.61ns 0.356ns 0.333ns 68.62ns 1.00 5 0.0254 160 B
spanCopy Span 1 31.36ns 0.529ns 0.494ns 31.22ns 0.46 1 0.0076 48 B
spanStrCreate Span 1 51.92ns 1.075ns 1.056ns 51.75ns 0.76 2 0.0216 136 B
spanStrCreateAlt Span 1 62.80ns 0.841ns 0.787ns 63.05ns 0.92 3 0.0267 168 B
spanStrCreateUnx4 Span 1 65.55ns 0.716ns 0.670ns 65.73ns 0.96 4 0.0267 168 B

5 and 10 chars

Method Categories Value Mean Error StdDev Median Ratio Rank Gen 0 Allocated
current Candidate 5 98.39ns 0.599ns 0.531ns 98.15ns 1.00 4 0.0280 176 B
fastMap Candidate 5 64.59ns 1.170ns 1.094ns 64.16ns 0.66 3 0.0216 136 B
fasterMap Candidate 5 54.99ns 0.511ns 0.478ns 54.95ns 0.56 1 0.0153 96 B
smallMap Candidate 5 61.75ns 1.051ns 0.983ns 61.69ns 0.63 2 0.0153 96 B
 
current Alt 5 104.15ns 2.107ns 2.342ns 105.23ns 1.00 4 0.0280 176 B
whileMap Alt 5 55.53ns 0.439ns 0.367ns 55.42ns 0.53 2 0.0153 96 B
allocateMap Alt 5 52.50ns 0.750ns 0.701ns 52.49ns 0.50 1 0.0153 96 B
unrollMapX2 Alt 5 55.48ns 0.780ns 0.729ns 55.24ns 0.53 2 0.0153 96 B
unrollMapX4 Alt 5 59.18ns 0.919ns 0.860ns 59.39ns 0.57 3 0.0153 96 B
current Fixed 5 103.54ns 1.900ns 1.866ns 103.56ns 1.00 3 0.0280 176 B
fixedCopy Fixed 5 37.75ns 0.698ns 0.619ns 37.71ns 0.36 1 0.0089 56 B
fixedCopyx2 Fixed 5 37.71ns 0.408ns 0.382ns 37.75ns 0.36 1 0.0089 56 B
fixedCopyx4 Fixed 5 39.36ns 0.488ns 0.407ns 39.38ns 0.38 2 0.0089 56 B
current Span 5 102.48ns 2.076ns 2.977ns 101.57ns 1.00 4 0.0280 176 B
spanCopy Span 5 40.11ns 0.787ns 0.966ns 39.84ns 0.39 1 0.0089 56 B
spanStrCreate Span 5 63.09ns 1.292ns 2.086ns 62.85ns 0.62 2 0.0229 144 B
spanStrCreateAlt Span 5 74.00ns 1.426ns 1.334ns 73.76ns 0.71 3 0.0280 176 B
spanStrCreateUnx4 Span 5 74.88ns 0.924ns 0.864ns 74.88ns 0.72 3 0.0280 176 B
 
 10 chars
current Candidate 10 137.76ns 1.850ns 1.640ns 137.51ns 1.00 4 0.0317 200 B
fastMap Candidate 10 81.41ns 0.725ns 0.678ns 81.29ns 0.59 3 0.0267 168 B
fasterMap Candidate 10 69.10ns 0.469ns 0.438ns 69.18ns 0.50 1 0.0191 120 B
smallMap Candidate 10 75.97ns 0.810ns 0.718ns 76.21ns 0.55 2 0.0191 120 B
 
current Alt 10 135.96ns 1.211ns 1.011ns 135.73ns 1.00 3 0.0317 200 B
whileMap Alt 10 70.25ns 1.253ns 1.110ns 70.04ns 0.52 2 0.0191 120 B
allocateMap Alt 10 67.91ns 0.863ns 0.721ns 67.95ns 0.50 1 0.0191 120 B
unrollMapX2 Alt 10 70.09ns 0.635ns 0.594ns 69.92ns 0.52 2 0.0191 120 B
unrollMapX4 Alt 10 68.89ns 0.695ns 0.650ns 68.92ns 0.51 1 0.0191 120 B
 
current Fixed 10 137.42ns 2.783ns 3.809ns 135.49ns 1.00 3 0.0317 200 B
fixedCopy Fixed 10 47.57ns 0.471ns 0.394ns 47.66ns 0.34 1 0.0114 72 B
fixedCopyx2 Fixed 10 48.22ns 0.301ns 0.282ns 48.20ns 0.35 1 0.0114 72 B
fixedCopyx4 Fixed 10 49.94ns 0.730ns 0.609ns 49.83ns 0.36 2 0.0114 72 B
 
current Span 10 136.76ns 1.311ns 1.226ns 136.86ns 1.00 5 0.0317 200 B
spanCopy Span 10 51.29ns 0.400ns 0.355ns 51.43ns 0.37 1 0.0114 72 B
spanStrCreate Span 10 75.43ns 0.913ns 0.763ns 75.27ns 0.55 2 0.0254 160 B
spanStrCreateAlt Span 10 87.43ns 0.599ns 0.531ns 87.50ns 0.64 4 0.0305 192 B
spanStrCreateUnx4 Span 10 85.80ns 1.017ns 0.901ns 85.81ns 0.63 3 0.0305 192 B

31 and 100 chars

Method Categories Value Mean Error StdDev Median Ratio Rank Gen 0 Allocated
current Candidate 31 296.99ns 5.675ns 12.217ns 294.12ns 1.00 4 0.0443 280 B
fastMap Candidate 31 139.43ns 1.113ns 1.041ns 139.34ns 0.46 3 0.0458 288 B
fasterMap Candidate 31 122.91ns 0.920ns 0.816ns 123.21ns 0.41 1 0.0317 200 B
smallMap Candidate 31 135.52ns 1.314ns 1.165ns 135.52ns 0.45 2 0.0317 200 B
 
current Alt 31 299.02ns 5.659ns 5.558ns 297.41ns 1.00 3 0.0443 280 B
whileMap Alt 31 120.16ns 2.376ns 3.005ns 117.99ns 0.41 1 0.0317 200 B
allocateMap Alt 31 129.14ns 0.867ns 0.811ns 129.41ns 0.43 2 0.0317 200 B
unrollMapX2 Alt 31 118.54ns 1.121ns 0.994ns 118.43ns 0.40 1 0.0317 200 B
unrollMapX4 Alt 31 119.67ns 1.613ns 1.430ns 119.46ns 0.40 1 0.0317 200 B
 
current Fixed 31 292.63ns 3.647ns 3.412ns 291.12ns 1.00 4 0.0443 280 B
fixedCopy Fixed 31 90.22ns 0.522ns 0.488ns 90.32ns 0.31 1 0.0178 112 B
fixedCopyx2 Fixed 31 92.90ns 0.598ns 0.559ns 92.94ns 0.32 2 0.0178 112 B
fixedCopyx4 Fixed 31 100.58ns 1.755ns 1.642ns 100.22ns 0.34 3 0.0178 112 B
 
current Span 31 295.64ns 1.518ns 1.267ns 295.66ns 1.00 5 0.0443 280 B
spanCopy Span 31 101.58ns 1.860ns 1.826ns 101.79ns 0.34 1 0.0178 112 B
spanStrCreate Span 31 132.14ns 2.560ns 2.395ns 132.83ns 0.45 2 0.0317 200 B
spanStrCreateAlt Span 31 146.46ns 2.965ns 3.530ns 147.03ns 0.49 4 0.0370 232 B
spanStrCreateUnx4 Span 31 140.53ns 2.264ns 2.007ns 141.06ns 0.47 3 0.0370 232 B
 
 100 chars
current Candidate 100 796.34ns 15.934ns 14.125ns 788.09ns 1.00 3 0.0877 552 B
fastMap Candidate 100 347.12ns 5.982ns 7.986ns 345.38ns 0.44 2 0.1106 696 B
fasterMap Candidate 100 311.80ns 1.806ns 1.601ns 312.26ns 0.39 1 0.0749 472 B
smallMap Candidate 100 344.51ns 2.428ns 2.027ns 345.05ns 0.43 2 0.0749 472 B
 
current Alt 100 802.38ns 14.461ns 13.527ns 803.31ns 1.00 5 0.0877 552 B
whileMap Alt 100 310.82ns 3.155ns 2.951ns 310.92ns 0.39 3 0.0749 472 B
allocateMap Alt 100 334.36ns 1.520ns 1.421ns 334.56ns 0.42 4 0.0749 472 B
unrollMapX2 Alt 100 283.68ns 1.994ns 1.865ns 284.27ns 0.35 2 0.0749 472 B
unrollMapX4 Alt 100 276.16ns 3.283ns 2.910ns 276.29ns 0.34 1 0.0749 472 B
 
current Fixed 100 804.66ns 5.603ns 4.679ns 805.33ns 1.00 4 0.0877 552 B
fixedCopy Fixed 100 226.07ns 1.051ns 0.877ns 226.04ns 0.28 1 0.0393 248 B
fixedCopyx2 Fixed 100 234.08ns 2.609ns 2.313ns 234.07ns 0.29 2 0.0391 248 B
fixedCopyx4 Fixed 100 245.54ns 2.454ns 2.175ns 245.17ns 0.31 3 0.0391 248 B
 
current Span 100 804.95ns 8.385ns 7.844ns 805.59ns 1.00 5 0.0877 552 B
spanCopy Span 100 257.32ns 1.536ns 1.283ns 257.22ns 0.32 1 0.0391 248 B
spanStrCreate Span 100 315.95ns 1.891ns 1.677ns 315.88ns 0.39 3 0.0534 336 B
spanStrCreateAlt Span 100 344.50ns 6.737ns 6.302ns 342.70ns 0.43 4 0.0587 368 B
spanStrCreateUnx4 Span 100 294.39ns 2.441ns 2.283ns 293.91ns 0.37 2 0.0587 368 B

256 and more chars

Method Categories Value Mean Error StdDev Median Ratio Rank Gen 0 Gen 1 Allocated
current Candidate 256 1,952.41ns 11.159ns 9.892ns 1,951.66ns 1.00 3 0.1869 - 1176 B
fastMap Candidate 256 800.65ns 7.703ns 6.829ns 802.18ns 0.41 2 0.2594 - 1632 B
fasterMap Candidate 256 739.04ns 9.457ns 8.847ns 735.24ns 0.38 1 0.1745 - 1096 B
smallMap Candidate 256 811.24ns 3.172ns 2.967ns 811.42ns 0.42 2 0.1745 - 1096 B
 
current Alt 256 1,950.68ns 16.943ns 15.849ns 1,948.46ns 1.00 5 0.1869 - 1176 B
whileMap Alt 256 730.78ns 7.339ns 6.506ns 731.13ns 0.37 3 0.1745 - 1096 B
allocateMap Alt 256 791.70ns 7.619ns 5.949ns 790.93ns 0.41 4 0.1745 - 1096 B
unrollMapX2 Alt 256 663.65ns 3.941ns 3.687ns 663.96ns 0.34 2 0.1745 - 1096 B
unrollMapX4 Alt 256 641.95ns 6.201ns 5.497ns 640.22ns 0.33 1 0.1745 - 1096 B
 
current Fixed 256 1,975.77ns 24.444ns 22.865ns 1,977.10ns 1.00 4 0.1869 - 1176 B
fixedCopy Fixed 256 551.65ns 7.620ns 7.127ns 549.79ns 0.28 1 0.0887 - 560 B
fixedCopyx2 Fixed 256 569.96ns 5.247ns 4.908ns 568.99ns 0.29 2 0.0887 - 560 B
fixedCopyx4 Fixed 256 589.82ns 4.786ns 3.997ns 590.19ns 0.30 3 0.0887 - 560 B
 
current Span 256 1,969.66ns 19.996ns 18.705ns 1,970.26ns 1.00 5 0.1869 - 1176 B
spanCopy Span 256 632.78ns 6.331ns 5.922ns 634.22ns 0.32 1 0.0887 - 560 B
spanStrCreate Span 256 732.16ns 5.446ns 5.094ns 733.44ns 0.37 3 0.1030 - 648 B
spanStrCreateAlt Span 256 772.54ns 5.266ns 4.668ns 773.87ns 0.39 4 0.1078 - 680 B
spanStrCreateUnx4 Span 256 665.30ns 10.455ns 9.779ns 665.44ns 0.34 2 0.1078 - 680 B
 
 1024 chars
current Candidate 1024 7,629.39ns 60.810ns 53.907ns 7,625.62ns 1.00 3 0.6714 - 4248 B
fastMap Candidate 1024 3,082.48ns 26.745ns 25.017ns 3,070.76ns 0.40 2 0.9918 0.0038 6240 B
fasterMap Candidate 1024 2,815.90ns 11.005ns 8.592ns 2,815.05ns 0.37 1 0.6638 - 4168 B
smallMap Candidate 1024 2,804.66ns 35.226ns 31.227ns 2,801.53ns 0.37 1 0.6638 - 4168 B
 
current Alt 1024 7,653.67ns 66.666ns 62.360ns 7,651.03ns 1.00 5 0.6714 - 4248 B
whileMap Alt 1024 2,820.49ns 16.317ns 15.263ns 2,820.79ns 0.37 3 0.6638 - 4168 B
allocateMap Alt 1024 3,046.43ns 17.272ns 14.423ns 3,047.76ns 0.40 4 0.6638 - 4168 B
unrollMapX2 Alt 1024 2,690.30ns 26.892ns 22.456ns 2,687.61ns 0.35 2 0.6638 - 4168 B
unrollMapX4 Alt 1024 2,530.67ns 14.308ns 13.383ns 2,531.89ns 0.33 1 0.6638 - 4168 B
 
current Fixed 1024 7,647.49ns 51.204ns 47.897ns 7,635.90ns 1.00 4 0.6714 - 4248 B
fixedCopy Fixed 1024 2,121.33ns 29.708ns 27.789ns 2,118.47ns 0.28 1 0.3319 - 2096 B
fixedCopyx2 Fixed 1024 2,164.05ns 11.677ns 9.117ns 2,164.54ns 0.28 2 0.3319 - 2096 B
fixedCopyx4 Fixed 1024 2,263.88ns 23.129ns 20.504ns 2,255.20ns 0.30 3 0.3319 - 2096 B
 
current Span 1024 7,669.55ns 67.241ns 62.898ns 7,679.77ns 1.00 4 0.6714 - 4248 B
spanCopy Span 1024 2,397.11ns 16.097ns 14.269ns 2,398.97ns 0.31 1 0.3319 - 2096 B
spanStrCreate Span 1024 2,814.25ns 43.848ns 41.016ns 2,815.72ns 0.37 3 0.3471 - 2184 B
spanStrCreateAlt Span 1024 2,865.30ns 22.238ns 20.801ns 2,866.39ns 0.37 3 0.3510 - 2216 B
spanStrCreateUnx4 Span 1024 2,448.91ns 35.363ns 31.348ns 2,452.51ns 0.32 2 0.3510 - 2216 B
 
 8192 chars
current Candidate 8192 61,304.72ns 350.006ns 327.396ns 61,285.96ns 1.00 3 5.1270 0.2441 32920 B
fastMap Candidate 8192 25,785.01ns 305.116ns 285.405ns 25,831.80ns 0.42 2 7.8125 0.3052 49248 B
fasterMap Candidate 8192 23,305.23ns 121.598ns 113.743ns 23,323.38ns 0.38 1 5.2185 0.1526 32840 B
smallMap Candidate 8192 23,291.14ns 125.680ns 117.561ns 23,337.55ns 0.38 1 5.2185 0.1526 32840 B
 
current Alt 8192 61,967.77ns 536.539ns 501.879ns 61,816.82ns 1.00 5 5.1270 0.2441 32920 B
whileMap Alt 8192 23,592.00ns 328.376ns 307.163ns 23,607.09ns 0.38 3 5.2185 0.1526 32840 B
allocateMap Alt 8192 24,950.07ns 201.715ns 178.815ns 24,959.22ns 0.40 4 5.2185 0.1526 32840 B
unrollMapX2 Alt 8192 22,003.28ns 116.477ns 97.264ns 22,005.69ns 0.36 2 5.2185 0.1526 32840 B
unrollMapX4 Alt 8192 21,074.46ns 333.249ns 295.416ns 21,083.36ns 0.34 1 5.2185 0.1526 32840 B
 
current Fixed 8192 61,401.46ns 508.529ns 450.797ns 61,337.03ns 1.00 4 5.1270 0.2441 32920 B
fixedCopy Fixed 8192 16,758.64ns 118.942ns 111.258ns 16,766.96ns 0.27 1 2.5940 - 16432 B
fixedCopyx2 Fixed 8192 17,569.22ns 151.468ns 141.683ns 17,581.75ns 0.29 2 2.5940 - 16432 B
fixedCopyx4 Fixed 8192 18,269.65ns 119.090ns 105.570ns 18,253.54ns 0.30 3 2.5940 - 16432 B
 
current Span 8192 62,410.59ns 785.822ns 735.058ns 62,375.61ns 1.00 4 5.1270 0.2441 32920 B
spanCopy Span 8192 19,228.95ns 113.409ns 94.702ns 19,240.35ns 0.31 1 2.5940 - 16432 B
spanStrCreate Span 8192 22,603.78ns 168.023ns 157.169ns 22,597.73ns 0.36 3 2.6245 - 16520 B
spanStrCreateAlt Span 8192 21,961.11ns 321.794ns 301.006ns 21,942.30ns 0.35 2 2.6245 - 16552 B
spanStrCreateUnx4 Span 8192 18,994.96ns 241.445ns 214.034ns 18,928.00ns 0.30 1 2.6245 - 16552 B

The code:

module StringMap =

    /// Basic String.map replacement, Array.map over CharArray
    /// dotnet mem: 6x(!) string size
    let fastMap mapper str =
        if String.IsNullOrEmpty str then String.Empty
        else
            str.ToCharArray()
            |> Array.map mapper
            |> String

    /// A faster String.map replacement, inlining the loop instead of Array.map. Uses CharArray.
    /// NOTES: used as baseline for StringMapUnsafe
    /// dotnet mem: 4x string size
    let fasterMap mapper str =
        if String.IsNullOrEmpty str then String.Empty
        else
            let result = str.ToCharArray()
            let mutable i = 0
            for c in result do
                result.[i] <- mapper c
                i <- i + 1

            String result

    /// Instead of for..in uses while. Otherwise same as fasterMap
    /// dotnet mem: 4x string size
    let whileMap mapper str =
        if String.IsNullOrEmpty str then String.Empty
        else
            let result = str.ToCharArray()
            let mutable i = 0
            while i < result.Length do
                result.[i] <- mapper result.[i]
                i <- i + 1

            String result

    /// Like fasterMap, but uses Array.zeroCreate with string-length
    /// dotnet mem: 4x string data
    let allocateMap mapper str =
        if String.IsNullOrEmpty str then String.Empty
        else
            let result = Array.zeroCreate str.Length
            let mutable i = 0
            for c in str do
                result.[i] <- mapper c
                i <- i + 1

            String result

    /// String.map equiv. optimized for very small maps. The break-even point is around 4 chars.
    let smallMap (mapper: char -> char) (str: string) =
        let len = String.length str
        if len > 4 then
            let result = str.ToCharArray()
            let mutable i = 0
            for c in result do
                result.[i] <- mapper c
                i <- i + 1

            String result

        elif len > 2 then
            if len = 3 
            then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
            else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

        elif len = 2 then
            String [|mapper str.[0]; mapper str.[1]|]

        elif len = 1 then
            (mapper str.[0]).ToString()

        else
            String.Empty


    /// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above. It suffers from non-elimination of range-checks.
    let unrollMapx2 (mapper: char -> char) (str: string) =
        let len = String.length str

        if len > 4 then
            let result = str.ToCharArray()
            let mutable i = 0
            while i < len - len % 2 do
                result.[i] <- mapper result.[i]
                result.[i + 1] <- mapper result.[i + 1]
                i <- i + 2

            // the remainder
            if len % 2 = 1 then
                result.[i] <- mapper result.[i]
                //i <- i + 1

            String result

        elif len > 2 then
            if len = 3 
            then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
            else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

        elif len = 2 then
            String [|mapper str.[0]; mapper str.[1]|]

        elif len = 1 then
            (mapper str.[0]).ToString()

        else
            String.Empty


    /// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above. 
    /// It suffers from non-elimination of range-checks.
    let unrollMapx4 (mapper: char -> char) (str: string) =
        let len = String.length str

        if len > 4 then
            let result = str.ToCharArray()
            let mutable i = 0
            while i < len - len % 4 do
                result.[i] <- mapper result.[i]
                result.[i + 1] <- mapper result.[i + 1]
                result.[i + 2] <- mapper result.[i + 2]
                result.[i + 3] <- mapper result.[i + 3]
                i <- i + 4

            // the remainder
            while i < len do
                result.[i] <- mapper result.[i]
                i <- i + 1

            String result

        elif len > 2 then
            if len = 3 
            then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
            else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

        elif len = 2 then
            String [|mapper str.[0]; mapper str.[1]|]

        elif len = 1 then
            (mapper str.[0]).ToString()

        else
            String.Empty


module StringMapUnsafe =
    /// Unrollx2. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
    let fixedCopyx2 (mapper: char -> char) (str: string) =

        let len = String.length str
        let result = String.Copy str
        use shadow = fixed result
        let mutable i = 0
        while i < len - len % 2 do
            NativePtr.set shadow i (mapper str.[i])
            NativePtr.set shadow (i + 1) (mapper str.[i + 1])
            i <- i + 2

        while i < len do
            NativePtr.set shadow i (mapper str.[i])
            i <- i + 1

        result

    /// Unrollx4. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
    let fixedCopyx4 (mapper: char -> char) (str: string) =
        let len = String.length str
        
        let result = String.Copy str
        use shadow = fixed result
        let mutable i = 0
        while i < len - len % 4 do
            NativePtr.set shadow i (mapper str.[i])
            NativePtr.set shadow (i + 1) (mapper str.[i + 1])
            NativePtr.set shadow (i + 1) (mapper str.[i + 2])
            NativePtr.set shadow (i + 1) (mapper str.[i + 3])
            i <- i + 4

        while i < len do
            NativePtr.set shadow i (mapper str.[i])
            i <- i + 1

        result

    /// Unrollx4. Create copy with String.Copy + fixed and use NativePtr to read and write from it, return modified string.
    let fixedCopy (mapper: char -> char) (str: string) =
        let len = String.length str

        /// a copy to use as result (we don't want to override the input reference string!)
        let result = String.Copy str

        /// pointer to the result-string
        use shadow = fixed result

        // loop-unroll
        let mutable i = 0
        while i < len - len % 4 do
            // NOTES:
            // - read/write to same mem address appears to be somewhat faster than reading from string by index, like with str.[i]
            // - while less IL instr are generated with calculating the address ourselves and then NativePtr.read/write, this appears to be faster (and safer).
            NativePtr.set shadow i (mapper (NativePtr.get shadow i))
            NativePtr.set shadow (i + 1) (mapper (NativePtr.get shadow (i + 1)))
            NativePtr.set shadow (i + 2) (mapper (NativePtr.get shadow (i + 2)))
            NativePtr.set shadow (i + 3) (mapper (NativePtr.get shadow (i + 3)))
            i <- i + 4

        // process the remainder
        while i < len do
            NativePtr.set shadow i (mapper (NativePtr.get shadow i))
            i <- i + 1

        result

    /// Unrollx4. Create copy with String.Copy, ReadOnlySpan over it, marshal-as read/write Span<char> over same copy. Return modified underlying string.
    /// Performance: ~20-30% faster than fasterMap.
    let spanCopy (mapper: char -> char) (str: string) =
        let len = String.length str

        let resultStr = String.Copy str
        let roSpan = str.AsSpan()
        let writeSpan = MemoryMarshal.CreateSpan(&MemoryMarshal.GetReference roSpan, len)
        
        // loop-unroll
        let mutable i = 0
        while i < len - len % 4 do
            writeSpan.[i] <- mapper roSpan.[i]
            writeSpan.[i + 1] <- mapper roSpan.[i + 1]
            writeSpan.[i + 2] <- mapper roSpan.[i + 2]
            writeSpan.[i + 3] <- mapper roSpan.[i + 3]
            i <- i + 4

        // process the remainder
        while i < len do
            writeSpan.[i] <- mapper roSpan.[i]
            i <- i + 1

        resultStr

    /// No unroll, uses String.Create with Span.
    /// Performance: 0-20% faster than fasterMap, but fast only on larger (64kb) input
    let spanCreate1 (mapper: char -> char) (str: string) =

        let len = String.length str
        String.Create(len, str, fun buf v -> 
            let mutable i = 0
            for c in v do
                buf.[i] <- mapper c
                i <- i + 1)

    /// No unroll, uses String.Create with Span, no closure
    /// Performance: 0-22% faster than fasterMap, the larger the intput, the bigger the gain (sometimes slower!)
    let spanCreate2 (mapper: char -> char) (str: string) =

        let len = String.length str
        String.Create(len, (str, mapper), fun buf (str, f) -> 
            let mutable i = 0
            for c in str do
                buf.[i] <- f c
                i <- i + 1)

    /// Unrollx4, uses String.Create with Span. All indexed operations on spans.
    /// Performance: 15-32% (!) faster than fasterMap, the larger the input, the bigger the gain
    let spanCreateUnrollx4 (mapper: char -> char) (str: string) =

        let len = String.length str
        String.Create(len, (str, mapper), fun target (str, f) -> 
            let mutable i = 0
            let source = str.AsSpan()
            while i < len - len % 4 do
                target.[i] <- f source.[i]
                target.[i + 1] <- f source.[i + 1]
                target.[i + 2] <- f source.[i + 2]
                target.[i + 3] <- f source.[i + 3]
                i <- i + 4

            // process the remainder
            while i < len do
                target.[i] <- f source.[i]
                i <- i + 1)

@abelbraaksma
Copy link
Contributor Author

@dsyme or @cartermp, can I get a confirmation that using fixed and pointer arithmetic is off limits (or not)? If it's allowed, I gladly use it, as it gives dramatic reduction in memory usage and increases performance, but obviously, it requires more careful coding and it will automatically mark certain functions as unsafe, which I think is a "no go". Also, it may be considered fine on arrays, but not on strings, I'd totally understand that.

There are already dramatic perf improvements using other methods, but fixed trumps all. And before I get to the array functions, I like to limit my scope (consider the above an exercise in what's possible).

I know Span cannot be used, otherwise that would be a good alternative and gets close to fixed in terms of memory and speed.

@cartermp
Copy link
Contributor

cartermp commented Jun 9, 2020

Yes, spans are off-limits (unless we decide to move FSharp.Core to also bring in a dependency on System.Memory). I'd actually love to do that but I know that @dsyme is quite allergic to them.

I think we should strive for consistency in implementation here. Array and list internals tend not to use pointers to accomplish things. I'd like for us to eek out as much perf and memory as possible if we can, but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety. So the second-best implementation is probably sufficient, and a clear upgrade anyways.

@abelbraaksma
Copy link
Contributor Author

unless we decide to move FSharp.Core to also bring in a dependency on System.Memory

If, for ImmutableArray, we decide to add that dependency, then I believe that includes System.Memory:

image

(which in turn gives us also System.Buffers and System.Numerics.Vectors, each of which open new ways of potentially improving certain key performance areas)

Array and list internals tend not to use pointers to accomplish things.
<snip/>but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety.

Agreed. Yeah, Span would give us wings in some areas :).

Ok, I'll move forward without the "extreme" optimizations than.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 9, 2020

Next up: String.concat. This one does nothing more than calling into String.Join, but since it takes a sequence for the concatenation and doesn't special-case the empty separator string, I figured, let's see what happens if I special case for array and list (if anyone know of others that should be special-cased, please let me know. For instance, I wonder if there's a safe way for library functions to get the underlying array or list if the sequence is created with Set.ofArray or Seq.ofList, so that people won't get punished for using that instead of casting().

TLDR conclusions:

  • Code changes for this one are simple and straightforward
  • Speedup of up to ~50% gain for array and up to 20% gain for lists. Seq stays the same.
  • Mem decrease: almost 60% less mem use for arrays, and ~40% for lists
  • GC pressure drops up to 50% except for very small lists and arrays

I think these numbers can be expected considering that String.Join uses an optimized and mutable string under the hood with pre-calculated size for Concat and when an array is passed. String.Join with IEnumerable uses a more resource-hungry StringBuilder.

Below: bold is fastest. I didn't select a champion if it was within margins of error.

BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
  [Host]     : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
  Job-FRSYQM : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

MaxRelativeError=0.02  Runtime=.NET Core 3.1  MaxIterationCount=10  
MaxWarmupIterationCount=5  MinIterationCount=5  MinWarmupIterationCount=2  
Method Sep CollSize CollType Mean Error StdDev Ratio Rank Gen 0 Gen 1 Allocated
current Small Array 484.4ns 7.61ns 1.98ns 1.00 2 0.0467 - 296 B
concatOpt Small Array 256.3ns 4.80ns 1.25ns 0.53 1 0.0420 - 264 B
 
current Small List 554.0ns 8.99ns 3.21ns 1.00 2 0.0477 - 304 B
concatOpt Small List 391.7ns 7.23ns 2.58ns 0.71 1 0.0710 - 448 B
 
current Small Seq 541.7ns 10.36ns 7.49ns 1.00 1 0.0467 - 296 B
concatOpt Small Seq 556.4ns 9.02ns 2.34ns 1.02 2 0.0467 - 296 B
 
current Medium Array 10,715.8ns 148.95ns 38.68ns 1.00 2 2.1667 0.0916 13656 B
concatOpt Medium Array 4,674.7ns 92.60ns 72.29ns 0.43 1 0.7553 0.0076 4784 B
 
current Medium List 12,422.7ns 188.90ns 29.23ns 1.00 2 2.1667 0.0916 13664 B
concatOpt Medium List 7,951.7ns 152.82ns 175.98ns 0.64 1 1.2665 0.0153 8008 B
 
current Medium Seq 12,905.6ns 435.52ns 651.86ns 1.00 2 2.1667 0.0916 13656 B
concatOpt Medium Seq 11,461.7ns 212.56ns 94.38ns 0.88 1 2.1667 0.0916 13656 B
 
current &|& Small Array 594.5ns 11.54ns 14.17ns 1.00 2 0.0648 - 408 B
concatOpt &|& Small Array 402.5ns 7.20ns 2.57ns 0.69 1 0.0596 - 376 B
 
current &|& Small List 654.4ns 12.02ns 8.69ns 1.00 2 0.0658 - 416 B
concatOpt &|& Small List 524.0ns 9.51ns 4.22ns 0.80 1 0.0887 - 560 B
 
current &|& Small Seq 624.0ns 12.35ns 12.13ns 1.00 1 0.0648 - 408 B
concatOpt &|& Small Seq 626.8ns 11.61ns 9.70ns 1.01 1 0.0648 - 408 B
 
current &|& Medium Array 13,245.0ns 255.18ns 133.47ns 1.00 2 2.5482 0.1373 16048 B
concatOpt &|& Medium Array 7,656.2ns 127.40ns 66.63ns 0.58 1 1.1368 0.0229 7176 B
 
current &|& Medium List 14,149.6ns 220.37ns 78.59ns 1.00 2 2.5482 0.1373 16056 B
concatOpt &|& Medium List 10,824.5ns 183.83ns 81.62ns 0.77 1 1.6479 0.0458 10400 B
 
current &|& Medium Seq 13,809.2ns 258.95ns 154.10ns 1.00 1 2.5482 0.1373 16048 B
concatOpt &|& Medium Seq 15,213.7ns 565.78ns 846.83ns 1.07 2 2.5330 0.1221 16048 B

Big inputs, 819200 items, 4874240 chars total, ~9.3MB

Method Sep CollType Mean Error StdDev Ratio Rank Gen 0 Gen 1 Gen 2 Allocated
current Array 27.00ms 0.351ms 0.091ms 1.00 2 1656.2500 718.7500 125.0000 18.65 MB
concatOpt Array 10.27ms 0.155ms 0.069ms 0.38 1 - - - 9.3 MB
 
current List 32.12ms 0.201ms 0.052ms 1.00 2 1656.2500 718.7500 125.0000 18.65 MB
concatOpt List 22.57ms 0.682ms 0.357ms 0.71 1 - - - 15.55 MB
 
current Seq 29.01ms 0.140ms 0.036ms 1.00 1 1656.2500 718.7500 125.0000 18.65 MB
concatOpt Seq 30.62ms 0.600ms 0.314ms 1.06 2 62.5000 62.5000 62.5000 31.55 MB
 
current &|& Array 32.07ms 0.707ms 0.109ms 1.00 2 2250.0000 875.0000 187.5000 24.92 MB
concatOpt &|& Array 16.51ms 0.259ms 0.115ms 0.51 1 - - - 12.42 MB
 
current &|& List 36.85ms 0.505ms 0.131ms 1.00 2 2285.7143 928.5714 214.2857 24.92 MB
concatOpt &|& List 28.60ms 0.564ms 0.250ms 0.78 1 - - - 18.67 MB
 
current &|& Seq 33.47ms 0.580ms 0.151ms 1.00 1 2266.6667 933.3333 200.0000 24.92 MB
concatOpt &|& Seq 33.34ms 0.453ms 0.118ms 1.00 1 2266.6667 933.3333 200.0000 24.92 MB

The source code:

let private concatArray sep (strings: string []) =
    match String.length sep with
    | 0 -> String.Concat strings
    | 1 -> String.Join(sep.[0], strings, 0, strings.Length)
    | _ -> String.Join(sep, strings, 0, strings.Length)

let concatOpt sep (strings: seq<string>) =
    match strings with
    | :? array<string> as arr -> concatArray sep arr
    | :? list<string> as lst -> lst |> List.toArray |> concatArray sep
    | _ ->
        // special-casing Seq for sep = String.Empty is *not* faster due to the involved 
        // overhead of Seq.toArray and the already well-optimized String.Join
        String.Join(sep, strings)

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 11, 2020

I think using screenshots is a little bit easier to read, those tables take too much space horizontally and vertically, making it hard to read.

String.iter, unrolled

There was little that could be improved for String.iter, but unrolling showed a ~25% perf gain:
image

String.iteri, unrolled

Basically same story:

image

String.mapi

Performance characteristics were, unsurprisingly, much the same as for String.map above. An easy gain of 2x performance, and reduction in mem and GC pressure.

Real advancement can be made here with fixed or with Span when the time's there. Just like String.map, unrolling had limited effect, but was more significant than for String.map. I suggest to take the unrollx2 variant (x4 and x8 had a difference of ~2%, not worth the effort). This would lead to 2.5x performance gain.

image

String.filter

Since Array.filter is highly optimized with a bit-mask method, I decided to try a method where I simply use Array.filter on the string. This improved perf dramatically by some 1.7-2.5 times, depending on whether the filter is true on "all" characters, "some" or "none" (false for all).

The buildAndCut method zero-allocates the target array and uses a String-overload to create a new string of only the filters characters. I thought this would have more mem pressure, but result was better in both mem and speed.

let filterByArray predicate (str: string) =
    if String.IsNullOrEmpty str then String.Empty
    else
        str.ToCharArray()
        |> Array.filter predicate
        |> String

let filterBuildAndCut predicate (str: string) =
    let len = String.length str

    if len = 0 then String.Empty
    else
        let target = Array.zeroCreate len
        let mutable i = 0
        for c in str do
            if predicate c then 
                target.[i] <- c
                i <- i + 1

        String(target, 0, i)

image

Still TODO:

  • String.collect
  • String.init
  • String.replicate
  • String.forall
  • String.exists
  • String.length

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 14, 2020

Next set:

String.length

At first it seems not much can be done here, but at close inspection, it calls String.get_Length twice on the happy path. Changing String.IsNullOrEmpty to isNull gives a 33% perf improvement. Since this is such a small function, it'll only show when used in tight loops. Anyway, it's as simple as changing the current impl. to this:

let nullOnly (str:string) =
    if isNull str then 0
    else str.Length

Fun in a loop to make is measurable, adjusted with OperationsPerInvoke:
image

(note that this is not a pure measurement, as there's auxiliary code present, but the difference between the two is only the body of the String.length function.)

It's interesting to take a peak at the disassembly here. If we zoom in on the body of the loop, which is inlined by the JIT, this is what the original String.length gave us. As can be seen, this does two pointer-dereferences to calculate the length:

; FSharp.Perf.BenchLength.current()
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       xor       edx,edx
M00_L00:  // loop body, i.e., String.length
       test      rax,rax              // rax & rax, in other words: null-check, sets flags
       je        short M00_L01        // if zero (null) jump back to loop
       cmp       dword ptr [rax+8],0  // compare string-length to 0
       jbe       short M00_L01        // if equal, jump back to loop
       mov       ecx,[rax+8]          // if not, get string length fallthrough to loop
M00_L01:
       inc       edx
       cmp       edx,2711
       jl        short M00_L00
       add       rsp,28
       ret
; Total bytes of code 43

When we compare that to the disassembly where we only use isNull, the pointer dereference is gone completely and we have only three short instructions left in the body:

; FSharp.Perf.BenchLength.nullOnly()
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       xor       edx,edx
M00_L00:  // loop body, i.e., nullOnly method
       test      rax,rax        // rax & rax, for null-check, sets flags
       je        short M00_L01  // if zero (null) jump back to loop
       mov       ecx,[rax+8]    // get the string-length from the string and fallthrough to loop
M00_L01:
       inc       edx
       cmp       edx,2711
       jl        short M00_L00
       add       rsp,28
       ret
; Total bytes of code 37

As you can see, we are down from five assembly instructions (incl two pointer dereferences) to only three assembly instructions.

Maybe a bit much of an analysis for such a small function, but it was fun to check what goes on under the hood. Note that the assembly can look different in other scenarios because of JIT optimizations.

It's also good to find out that the JIT does not do an extra null-check for the callvirt IL instruction, as it already appears to know that the reference cannot be null at that point. In fact, the whole call to get_Length disappears and essentially boils down to a single mov ecx, [rax+8].

String.exists && String.forall

This method takes a predicate that operates on the char. There's no direct equiv. in the .NET stdlibs. I've tried several alternatives of coding this, esp. since at first sight it seems the original is not tail-recursive, but this is recognized by the JIT anyway and it is unrolled in becomes an efficient loop.

Same is true for String.forall.

The only change to make here is the redundant call to get_Length, which only has an impact when searching very short strings.

tbc...

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 15, 2020

Next up:

String.replicate

There're some serious optimizations that are possible here, esp. if we would open up use of cpblk or initblk (they've been optimized, see dotnet/coreclr#7198; we'll maybe get there when we start with array functions), but for this round I'll stick to optimizing with a simple Array.Copy, which internally calls an optimized native method already.

This gives us between 1.2x and 8x (!) performance improvements. Occasionally, the Array.Copy method can be a few percents slower, but this seems to be influenced by LOH when doing this on really large strings and counts. Since the general case is likely small strings, this greatly improves the general case.

The number of iterations and memcopies is now O(log(n)) instead of O(n), and there's no redundant memory allocated (other than for the required copy of the string at the end).

The optimization in code:

    let replicateOpt count (str:string) =
        if isNull str then
            String.Empty
        else
            let len = str.Length
            let target = Array.zeroCreate (len * count)
            let source = str.ToCharArray()
            Array.Copy(source, 0, target, 0, len)
            let mutable i = len
            while i * 2 < target.Length do
                Array.Copy(target, 0, target, i, i)
                i <- i * 2

            Array.Copy(target, 0, target, i, target.Length - i)
            String target

Speed:
image

Memory:
image

String.init

This one proves to be tricky to optimize, as the performance distribution of both the array-based new version and the existing implementation based on StringBuilder is not uniform for both memory and speed. This makes comparison hard. When Count is small (number of iterations) then the new version is the clear winner, but at larger counts, esp. if the returned strings are small (range 1-2 chars), the original is much faster and uses less memory.

When it comes a random distribution of strings in the range 0-200 chars, the difference is neglible*:

image

* EDIT: the "empty" category wasn't empty, but same as "5or16", by accident. And "random" used Random::Next, which caused too much noise. You can disregard those rows. Oh, and replicateOpt really is initOpt, just missed while renaming.

Unless I can think of something better (unrolling the original, had no effect either), this one should probably remain as-is, since we won't be able to predict the average result of the initialize function.

The code:

let initOpt count (initializer: int -> string) =
    let target = Array.zeroCreate count
    for i=0 to count - 1 do 
        target.[i] <- initializer i

    String.Concat target

tbc (only String.collect remains)

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 16, 2020

String.init revisited

I couldn't let go. From the prev. measurements, it was clear that using the highly optimized String.Concat was significantly faster on small counts, so I leveraged that and now anything below count = 8 is between 12x (!!) and 1.1x faster, most over 2x faster, depending on size of count and size of input (larger strings means greater improvement).

Memory use and GC pressure also drops by 2x - 8x.

image

The code (from count > 4 it becomes a paramarray, also note that when count > 8 will be put up top by the compiler):

/// Using ParamArray as an optimized version of Array gives much better performance on small counts (up to 10x faster).
let initOpt (count:int) (initializer: int-> string) =
    match count with
    | 0 -> String.Empty
    | 1 -> initializer 0
    | 2 -> String.Concat(initializer 0, initializer 1)
    | 3 -> String.Concat(initializer 0, initializer 1, initializer 2)
    | 4 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3)
    | 5 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4)
    | 6 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5)
    | 7 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5, initializer 6)
    | 8 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5, initializer 6, initializer 7)
    | _ when count > 8 ->
        let res = StringBuilder count
        for i = 0 to count - 1 do 
            res.Append(initializer i) |> ignore
        res.ToString()
    | _ ->
        failwith "error"  // will become invalidArgInputMustBeNonNegative 

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 16, 2020

String.init revisited (2) and final

It turned out that through copy and pasting I made mistakes on the first measurements, wrongly believing that using the array-based approach was so bad. I stabilized "random" between runs, so that all measurements can be compared, and I fixed the code where the wrong functions were called.

  • The fix now special-cases for very small amounts and unrolls the loop a little
  • Special-casing was less fast than array-based above count = 4, so I dropped that
  • If the initialize function only returns very small strings of 1-3 chars, there's a performance degradation
  • If the output of initialize is 4 or 5 chars, it evens out, if it is random or larger than that, or zero chars, the array-based version wins in all cases
  • Memory pressure is, on average 60% less than before, GC pressure oftentimes even 4x better.

Let's get the numbers:

image

The code is now slightly more complex, but still not by much:

/// 2-10x faster on small counts, about 20% faster for very large count, for all cases, less memory pressure.
let initWithArray count (initializer: int -> string) =
    if count < 5 then
            match count with
            | 0 -> String.Empty
            | 1 -> initializer 0
            | 2 -> String.Concat(initializer 0, initializer 1)
            | 3 -> String.Concat(initializer 0, initializer 1, initializer 2)
            | 4 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3)
            | _ -> failwith "error" // arg error
    else
        let target = Array.zeroCreate count
        let mutable i = 0
        while i < count - count % 4 do 
            target.[i] <- initializer i
            target.[i + 1] <- initializer(i + 1)
            target.[i + 2] <- initializer(i + 2)
            target.[i + 3] <- initializer(i + 3)
            i <- i + 4

        while i < count do
            target.[i] <- initializer i
            i <- i + 1

        String.Concat target

If we zoom in on the break-even point for when the old version was faster than the new version, it is here, And as you can see, still less GC pressure:

image

@forki
Copy link
Contributor

forki commented Jun 16, 2020

Are you planning to send fixes?

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 16, 2020

@forki, absolutely, as soon as the string functions are done (one left), I'll send in a PR. I figured it made more sense to first do the exercise, timings etc, so that the reasoning behind the fixes, which aren't always obvious, are at least logged.

@forki
Copy link
Contributor

forki commented Jun 16, 2020 via email

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 16, 2020

Please do it as separate PRs so that it's easier to get those accepted

@forki I was figuring to do one PR for the string stuff (others for array or whatever else comes my way), which is really just a few lines in a single file, but if you think it is better to do it function-by-function, I can do that, but I thought that would make it harder instead of easier...

@cartermp, @TIHan, @dsyme, any preference here so that I don't have to do the PR's twice?

@forki
Copy link
Contributor

forki commented Jun 16, 2020 via email

@cartermp
Copy link
Contributor

Yep, the smaller/more contained the better

@abelbraaksma
Copy link
Contributor Author

Ok, got it, I'll make a separate pr for each function then,and I'll link it to the relevant comment.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Jun 17, 2020

@KevinRansom, an example that uses buffered StringBuilder, as we discussed, is the String.concat function. On sequences, it performs better than other methods. It uses System.String::Join, which internally uses the buffered SB (but, even the BCL does not use that if the input is an array). See the timings here: #9390 (comment)

@KevinRansom
Copy link
Member

@abelbraaksma , thanks I will look at it, and try to understand why.

@cartermp
Copy link
Contributor

@abelbraaksma how far along do you think this one is in terms of getting to closure?

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Dec 22, 2020

@cartermp, I think there are quite a few improvements already, but I've yet to check in the array functions improvements. I might get another stab at it during the holidays (won't be going anywhere anyway, it's still 2020:))

@dsyme
Copy link
Contributor

dsyme commented Mar 30, 2022

I'm going to close this out - @abelbraaksma This was an amazing analysis - can you add new PRs or issues for any other specific small improvements you still see as possible?

@dsyme dsyme closed this as completed Mar 30, 2022
@abelbraaksma
Copy link
Contributor Author

@dsyme, sure! I forgot about this one, the thread is still useful for other areas of optimization, but I agree that they should go into smaller, more containable issues and PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Library Issues for FSharp.Core not covered elsewhere Feature Request
Projects
None yet
Development

No branches or pull requests

6 participants