Performance of certain build-in functions, esp the ones related to Array manipulation #9390

abelbraaksma · 2020-06-03T19:09:54Z

I've noticed for some time now that quite a few functions that are called in many places are suboptimal in terms of performance. For instance, Array.create, Array.init, Array.concat and Array.filter can be improved upon with sometimes about 1.2-1.5x the speed, sometimes even manyfold (I got Array.replicate times source |> Array.concat 25X faster with a dedicated Array.repeat, so that was a bit of cheating).

Sometimes such improvements relate specifically to large arrays, sometimes specifically to small ones. In line of that, String.xxx functions that use StringBuilder when replaced with ToCharArray() instead improve by about 20-30%.

Before I start making a PR, which would certainly include showing the timings with BenchmarkDotNet for a variety of inputs, would such a PR be welcome? While nothing I tried is really complex, and what I tried is provably correct, as always with optimizations, they will increase the complexity of the code.

To get some idea, here's String.map vs a very simple optimization of str.ToCharArray() |> Array.map mapper |> String, which in many cases wins 40% (needs more input lengths comparisons, though, this was with 12-char string):

Method	Job	Jit	Platform	Runtime	Mean	Error	StdDev	Min	Max	Ratio	Rank	Gen 0	Gen 1	Gen 2	Allocated
fastMap	Job-MACJMZ	LegacyJit	X64	.NET 4.8	106.91ns	1.069ns	1.000ns	105.31ns	108.28ns	0.65	1	0.0263	-	-	168 B
stringMap	Job-MACJMZ	LegacyJit	X64	.NET 4.8	164.29ns	2.743ns	2.291ns	160.02ns	169.34ns	1.00	2	0.0315	-	-	201 B

fastMap	Job-CZIGBL	LegacyJit	X86	.NET 4.8	118.16ns	1.094ns	0.970ns	116.88ns	120.14ns	0.64	1	0.0219	-	-	120 B
stringMap	Job-CZIGBL	LegacyJit	X86	.NET 4.8	184.93ns	1.647ns	1.540ns	181.82ns	187.86ns	1.00	2	0.0233	-	-	128 B

fastMap	Job-FZSSVS	RyuJit	X64	.NET 4.8	95.61ns	0.974ns	0.911ns	94.33ns	97.43ns	0.63	1	0.0261	-	-	168 B
stringMap	Job-FZSSVS	RyuJit	X64	.NET 4.8	151.22ns	1.639ns	1.368ns	149.13ns	154.09ns	1.00	2	0.0315	-	-	201 B

fastMap	Job-OXSCJZ	LegacyJit	X86	.NET 4.6.1	117.96ns	1.202ns	1.065ns	116.12ns	119.48ns	0.62	1	0.0227	-	-	120 B
stringMap	Job-OXSCJZ	LegacyJit	X86	.NET 4.6.1	191.32ns	3.016ns	2.673ns	188.10ns	196.17ns	1.00	2	0.0237	-	-	128 B

The text was updated successfully, but these errors were encountered:

cartermp · 2020-06-03T21:35:42Z

@abelbraaksma great observation. Would you be interested in submitting a PR to adjust these functions?

abelbraaksma · 2020-06-03T21:48:47Z

@cartermp, sure thing, I just wasn't certain how much of "performance sugar" is allowed. I mean, I would assume using Span is not possible because we still support older .NET versions, but heavily-used functions like Array.create benefits tremendously with some loop-unrolling and block-alignment, for instance.

I'll submit a PR (will probably mark it WIP as this may need some iterations) and have you comment on whether certain performance-doping is allowed or not.

abelbraaksma · 2020-06-03T21:52:08Z

@cartermp, one other question, for perf comparison, would it be enough to just run RuyJit with .NET 4.8 on 64 bit, plus one LegacyJit on x86, or should I make more system configurations (.NET Core, Standard, I don't have Mono installed...)? It should be enough to focus on the 80% users group right, and not focus too much on niche configurations?

cartermp · 2020-06-03T22:11:35Z

I think .NET Core (latest) and .NET Framework (any version) would be good enough to run some benchmarks against.

Regarding performance sugar, yes Span can't be used because that would make FSharp.Core inaccessibly anywhere except for .NET Core and Xamarin. But anything else should be fine, so long as it doesn't change the public API or the end result of the functions.

TIHan · 2020-06-04T23:59:50Z

I think a PR would be fine.

abelbraaksma · 2020-06-08T20:08:16Z

After playing around a little with the various ways String.map can be optimized, I'm pasting the summarized results here. I expect to be able to use the "lessons learned" here with the other functions, so that I can settle with less variants to test.

What I tried:

Removing StringBuilder, as we know the size of the string beforehand.
Inlining the loop as opposed to Array.map and loop unrolling. The latter proved not so effective as can be seen below (x2 means 2 unrolls, x4 means 4)
while vs for, which should be indistinguishable mostly, and on .NET Core that's true, but somehow on .NET Framework (not shown here) while was significantly slower
several ways of allocating the new string
an approach with fixed and String.Copy. These are the fastest by far, and use least memory, but may not be "admissable" as it makes the function unsafe.
for comparison, I tried a view approaches with Span and String.Create. These cannot be used, as they require a dependency for .NET Framework. They are a good alternative to fixed, also little memory use and faster than the rest, except on small strings.
optimizing for small strings, which is smallMap in the overview. This is the winner on the low range, but slightly slower for larger input (though this may be measurement deviation, the code is the same as fasterMap, except for a few leading ifs).

My own verdict:

If possible, I would go for fixed without loop-unrolling. It is simple enough, and it uses least memory, and is 3x to even 4x faster than the current implementation. The larger the input, the bigger the difference.
Otherwise, the choice should be between fasterMap, simple and straightforward, but generally fastest without "doping", or smallMap, if we want to have top performance on strings up to 8 chars (above that, they're mostly equal).

Note that any alternative I tried was faster than the current implementation for average input. And almost all of them have less GC pressure and memory use.

I've also tested with very large inputs up until 100MB, usually the perf difference shift in favor of the new algorithms, because much less memory is used. With 100MB, fixed is almost 5x times faster than current.

Benchmarks were run with BDN, using 0.01 max relative error. Originally each test was run with a variety of mapper functions, but that made little change, so I left those variations out.

bold is fastest of the likely candidates
italics is fastest of the fixed versions
I didn't bother to tag Span and Alt, as I don't think they are possible candidates, they are here for comparison.
"current" means String.map as it is now in the FSharp.Core library. It is repeated for each block. Candidate means they are simplest to adopt, Alt means also simple, but slightly slower on .NET Framework than shown here, I don't prefer them. Fixed is fixed, and Span is span.

BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
  [Host]        : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
  .NET Core 3.1 : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

Job=.NET Core 3.1  Runtime=.NET Core 3.1

0 and 1 chars

Method	Categories	Value	Mean	Error	StdDev	Median	Ratio	Rank	Gen 0	Allocated
current	Candidate	0	12.86ns	0.101ns	0.089ns	12.89ns	1.00	1	0.0038	24 B
fastMap	Candidate	0	13.07ns	0.078ns	0.069ns	13.06ns	1.02	2	0.0038	24 B
fasterMap	Candidate	0	12.75ns	0.071ns	0.067ns	12.76ns	0.99	1	0.0038	24 B
smallMap	Candidate	0	16.57ns	0.097ns	0.091ns	16.53ns	1.29	3	0.0038	24 B

current	Alt	0	12.76ns	0.173ns	0.154ns	12.76ns	1.00	1	0.0038	24 B
whileMap	Alt	0	12.88ns	0.073ns	0.065ns	12.89ns	1.01	1	0.0038	24 B
allocateMap	Alt	0	12.86ns	0.161ns	0.151ns	12.89ns	1.01	1	0.0038	24 B
unrollMapX2	Alt	0	15.88ns	0.213ns	0.199ns	15.78ns	1.25	3	0.0038	24 B
unrollMapX4	Alt	0	15.43ns	0.082ns	0.076ns	15.43ns	1.21	2	0.0038	24 B

current	Fixed	0	12.59ns	0.172ns	0.161ns	12.51ns	1.00	1	0.0038	24 B
fixedCopy	Fixed	0	28.35ns	0.361ns	0.338ns	28.19ns	2.25	2	0.0076	48 B
fixedCopyx2	Fixed	0	28.36ns	0.212ns	0.177ns	28.42ns	2.25	2	0.0076	48 B
fixedCopyx4	Fixed	0	28.51ns	0.300ns	0.281ns	28.57ns	2.26	2	0.0076	48 B

current	Span	0	12.90ns	0.126ns	0.118ns	12.87ns	1.00	1	0.0038	24 B
spanCopy	Span	0	28.60ns	0.169ns	0.141ns	28.65ns	2.22	2	0.0076	48 B
spanStrCreate	Span	0	35.44ns	0.395ns	0.350ns	35.37ns	2.75	3	0.0178	112 B
spanStrCreateAlt	Span	0	48.54ns	0.936ns	0.919ns	48.79ns	3.76	4	0.0229	144 B
spanStrCreateUnx4	Span	0	47.61ns	0.984ns	1.171ns	47.65ns	3.70	4	0.0229	144 B

current	Candidate	1	69.09ns	0.852ns	0.756ns	69.10ns	1.00	4	0.0254	160 B
fastMap	Candidate	1	52.83ns	0.654ns	0.612ns	52.60ns	0.76	3	0.0178	112 B
fasterMap	Candidate	1	45.80ns	0.482ns	0.451ns	45.85ns	0.66	2	0.0127	80 B
smallMap	Candidate	1	24.11ns	0.419ns	0.372ns	24.11ns	0.35	1	0.0076	48 B

current	Alt	1	67.27ns	0.505ns	0.473ns	67.23ns	1.00	4	0.0254	160 B
whileMap	Alt	1	46.82ns	0.499ns	0.467ns	46.75ns	0.70	3	0.0127	80 B
allocateMap	Alt	1	43.11ns	0.635ns	0.563ns	42.95ns	0.64	2	0.0127	80 B
unrollMapX2	Alt	1	23.80ns	0.247ns	0.231ns	23.87ns	0.35	1	0.0076	48 B
unrollMapX4	Alt	1	23.59ns	0.148ns	0.138ns	23.64ns	0.35	1	0.0076	48 B

current	Fixed	1	69.15ns	1.132ns	1.059ns	69.30ns	1.00	3	0.0254	160 B
fixedCopy	Fixed	1	30.94ns	0.520ns	0.486ns	31.12ns	0.45	2	0.0076	48 B
fixedCopyx2	Fixed	1	30.36ns	0.396ns	0.371ns	30.20ns	0.44	1	0.0076	48 B
fixedCopyx4	Fixed	1	31.12ns	0.271ns	0.226ns	31.07ns	0.45	2	0.0076	48 B

current	Span	1	68.61ns	0.356ns	0.333ns	68.62ns	1.00	5	0.0254	160 B
spanCopy	Span	1	31.36ns	0.529ns	0.494ns	31.22ns	0.46	1	0.0076	48 B
spanStrCreate	Span	1	51.92ns	1.075ns	1.056ns	51.75ns	0.76	2	0.0216	136 B
spanStrCreateAlt	Span	1	62.80ns	0.841ns	0.787ns	63.05ns	0.92	3	0.0267	168 B
spanStrCreateUnx4	Span	1	65.55ns	0.716ns	0.670ns	65.73ns	0.96	4	0.0267	168 B

5 and 10 chars

Method	Categories	Value	Mean	Error	StdDev	Median	Ratio	Rank	Gen 0	Allocated
current	Candidate	5	98.39ns	0.599ns	0.531ns	98.15ns	1.00	4	0.0280	176 B
fastMap	Candidate	5	64.59ns	1.170ns	1.094ns	64.16ns	0.66	3	0.0216	136 B
fasterMap	Candidate	5	54.99ns	0.511ns	0.478ns	54.95ns	0.56	1	0.0153	96 B
smallMap	Candidate	5	61.75ns	1.051ns	0.983ns	61.69ns	0.63	2	0.0153	96 B

current	Alt	5	104.15ns	2.107ns	2.342ns	105.23ns	1.00	4	0.0280	176 B
whileMap	Alt	5	55.53ns	0.439ns	0.367ns	55.42ns	0.53	2	0.0153	96 B
allocateMap	Alt	5	52.50ns	0.750ns	0.701ns	52.49ns	0.50	1	0.0153	96 B
unrollMapX2	Alt	5	55.48ns	0.780ns	0.729ns	55.24ns	0.53	2	0.0153	96 B
unrollMapX4	Alt	5	59.18ns	0.919ns	0.860ns	59.39ns	0.57	3	0.0153	96 B

current	Fixed	5	103.54ns	1.900ns	1.866ns	103.56ns	1.00	3	0.0280	176 B
fixedCopy	Fixed	5	37.75ns	0.698ns	0.619ns	37.71ns	0.36	1	0.0089	56 B
fixedCopyx2	Fixed	5	37.71ns	0.408ns	0.382ns	37.75ns	0.36	1	0.0089	56 B
fixedCopyx4	Fixed	5	39.36ns	0.488ns	0.407ns	39.38ns	0.38	2	0.0089	56 B

current	Span	5	102.48ns	2.076ns	2.977ns	101.57ns	1.00	4	0.0280	176 B
spanCopy	Span	5	40.11ns	0.787ns	0.966ns	39.84ns	0.39	1	0.0089	56 B
spanStrCreate	Span	5	63.09ns	1.292ns	2.086ns	62.85ns	0.62	2	0.0229	144 B
spanStrCreateAlt	Span	5	74.00ns	1.426ns	1.334ns	73.76ns	0.71	3	0.0280	176 B
spanStrCreateUnx4	Span	5	74.88ns	0.924ns	0.864ns	74.88ns	0.72	3	0.0280	176 B

10 chars
current	Candidate	10	137.76ns	1.850ns	1.640ns	137.51ns	1.00	4	0.0317	200 B
fastMap	Candidate	10	81.41ns	0.725ns	0.678ns	81.29ns	0.59	3	0.0267	168 B
fasterMap	Candidate	10	69.10ns	0.469ns	0.438ns	69.18ns	0.50	1	0.0191	120 B
smallMap	Candidate	10	75.97ns	0.810ns	0.718ns	76.21ns	0.55	2	0.0191	120 B

current	Alt	10	135.96ns	1.211ns	1.011ns	135.73ns	1.00	3	0.0317	200 B
whileMap	Alt	10	70.25ns	1.253ns	1.110ns	70.04ns	0.52	2	0.0191	120 B
allocateMap	Alt	10	67.91ns	0.863ns	0.721ns	67.95ns	0.50	1	0.0191	120 B
unrollMapX2	Alt	10	70.09ns	0.635ns	0.594ns	69.92ns	0.52	2	0.0191	120 B
unrollMapX4	Alt	10	68.89ns	0.695ns	0.650ns	68.92ns	0.51	1	0.0191	120 B

current	Fixed	10	137.42ns	2.783ns	3.809ns	135.49ns	1.00	3	0.0317	200 B
fixedCopy	Fixed	10	47.57ns	0.471ns	0.394ns	47.66ns	0.34	1	0.0114	72 B
fixedCopyx2	Fixed	10	48.22ns	0.301ns	0.282ns	48.20ns	0.35	1	0.0114	72 B
fixedCopyx4	Fixed	10	49.94ns	0.730ns	0.609ns	49.83ns	0.36	2	0.0114	72 B

current	Span	10	136.76ns	1.311ns	1.226ns	136.86ns	1.00	5	0.0317	200 B
spanCopy	Span	10	51.29ns	0.400ns	0.355ns	51.43ns	0.37	1	0.0114	72 B
spanStrCreate	Span	10	75.43ns	0.913ns	0.763ns	75.27ns	0.55	2	0.0254	160 B
spanStrCreateAlt	Span	10	87.43ns	0.599ns	0.531ns	87.50ns	0.64	4	0.0305	192 B
spanStrCreateUnx4	Span	10	85.80ns	1.017ns	0.901ns	85.81ns	0.63	3	0.0305	192 B

31 and 100 chars

Method	Categories	Value	Mean	Error	StdDev	Median	Ratio	Rank	Gen 0	Allocated
current	Candidate	31	296.99ns	5.675ns	12.217ns	294.12ns	1.00	4	0.0443	280 B
fastMap	Candidate	31	139.43ns	1.113ns	1.041ns	139.34ns	0.46	3	0.0458	288 B
fasterMap	Candidate	31	122.91ns	0.920ns	0.816ns	123.21ns	0.41	1	0.0317	200 B
smallMap	Candidate	31	135.52ns	1.314ns	1.165ns	135.52ns	0.45	2	0.0317	200 B

current	Alt	31	299.02ns	5.659ns	5.558ns	297.41ns	1.00	3	0.0443	280 B
whileMap	Alt	31	120.16ns	2.376ns	3.005ns	117.99ns	0.41	1	0.0317	200 B
allocateMap	Alt	31	129.14ns	0.867ns	0.811ns	129.41ns	0.43	2	0.0317	200 B
unrollMapX2	Alt	31	118.54ns	1.121ns	0.994ns	118.43ns	0.40	1	0.0317	200 B
unrollMapX4	Alt	31	119.67ns	1.613ns	1.430ns	119.46ns	0.40	1	0.0317	200 B

current	Fixed	31	292.63ns	3.647ns	3.412ns	291.12ns	1.00	4	0.0443	280 B
fixedCopy	Fixed	31	90.22ns	0.522ns	0.488ns	90.32ns	0.31	1	0.0178	112 B
fixedCopyx2	Fixed	31	92.90ns	0.598ns	0.559ns	92.94ns	0.32	2	0.0178	112 B
fixedCopyx4	Fixed	31	100.58ns	1.755ns	1.642ns	100.22ns	0.34	3	0.0178	112 B

current	Span	31	295.64ns	1.518ns	1.267ns	295.66ns	1.00	5	0.0443	280 B
spanCopy	Span	31	101.58ns	1.860ns	1.826ns	101.79ns	0.34	1	0.0178	112 B
spanStrCreate	Span	31	132.14ns	2.560ns	2.395ns	132.83ns	0.45	2	0.0317	200 B
spanStrCreateAlt	Span	31	146.46ns	2.965ns	3.530ns	147.03ns	0.49	4	0.0370	232 B
spanStrCreateUnx4	Span	31	140.53ns	2.264ns	2.007ns	141.06ns	0.47	3	0.0370	232 B

100 chars
current	Candidate	100	796.34ns	15.934ns	14.125ns	788.09ns	1.00	3	0.0877	552 B
fastMap	Candidate	100	347.12ns	5.982ns	7.986ns	345.38ns	0.44	2	0.1106	696 B
fasterMap	Candidate	100	311.80ns	1.806ns	1.601ns	312.26ns	0.39	1	0.0749	472 B
smallMap	Candidate	100	344.51ns	2.428ns	2.027ns	345.05ns	0.43	2	0.0749	472 B

current	Alt	100	802.38ns	14.461ns	13.527ns	803.31ns	1.00	5	0.0877	552 B
whileMap	Alt	100	310.82ns	3.155ns	2.951ns	310.92ns	0.39	3	0.0749	472 B
allocateMap	Alt	100	334.36ns	1.520ns	1.421ns	334.56ns	0.42	4	0.0749	472 B
unrollMapX2	Alt	100	283.68ns	1.994ns	1.865ns	284.27ns	0.35	2	0.0749	472 B
unrollMapX4	Alt	100	276.16ns	3.283ns	2.910ns	276.29ns	0.34	1	0.0749	472 B

current	Fixed	100	804.66ns	5.603ns	4.679ns	805.33ns	1.00	4	0.0877	552 B
fixedCopy	Fixed	100	226.07ns	1.051ns	0.877ns	226.04ns	0.28	1	0.0393	248 B
fixedCopyx2	Fixed	100	234.08ns	2.609ns	2.313ns	234.07ns	0.29	2	0.0391	248 B
fixedCopyx4	Fixed	100	245.54ns	2.454ns	2.175ns	245.17ns	0.31	3	0.0391	248 B

current	Span	100	804.95ns	8.385ns	7.844ns	805.59ns	1.00	5	0.0877	552 B
spanCopy	Span	100	257.32ns	1.536ns	1.283ns	257.22ns	0.32	1	0.0391	248 B
spanStrCreate	Span	100	315.95ns	1.891ns	1.677ns	315.88ns	0.39	3	0.0534	336 B
spanStrCreateAlt	Span	100	344.50ns	6.737ns	6.302ns	342.70ns	0.43	4	0.0587	368 B
spanStrCreateUnx4	Span	100	294.39ns	2.441ns	2.283ns	293.91ns	0.37	2	0.0587	368 B

256 and more chars

Method	Categories	Value	Mean	Error	StdDev	Median	Ratio	Rank	Gen 0	Gen 1	Allocated
current	Candidate	256	1,952.41ns	11.159ns	9.892ns	1,951.66ns	1.00	3	0.1869	-	1176 B
fastMap	Candidate	256	800.65ns	7.703ns	6.829ns	802.18ns	0.41	2	0.2594	-	1632 B
fasterMap	Candidate	256	739.04ns	9.457ns	8.847ns	735.24ns	0.38	1	0.1745	-	1096 B
smallMap	Candidate	256	811.24ns	3.172ns	2.967ns	811.42ns	0.42	2	0.1745	-	1096 B

current	Alt	256	1,950.68ns	16.943ns	15.849ns	1,948.46ns	1.00	5	0.1869	-	1176 B
whileMap	Alt	256	730.78ns	7.339ns	6.506ns	731.13ns	0.37	3	0.1745	-	1096 B
allocateMap	Alt	256	791.70ns	7.619ns	5.949ns	790.93ns	0.41	4	0.1745	-	1096 B
unrollMapX2	Alt	256	663.65ns	3.941ns	3.687ns	663.96ns	0.34	2	0.1745	-	1096 B
unrollMapX4	Alt	256	641.95ns	6.201ns	5.497ns	640.22ns	0.33	1	0.1745	-	1096 B

current	Fixed	256	1,975.77ns	24.444ns	22.865ns	1,977.10ns	1.00	4	0.1869	-	1176 B
fixedCopy	Fixed	256	551.65ns	7.620ns	7.127ns	549.79ns	0.28	1	0.0887	-	560 B
fixedCopyx2	Fixed	256	569.96ns	5.247ns	4.908ns	568.99ns	0.29	2	0.0887	-	560 B
fixedCopyx4	Fixed	256	589.82ns	4.786ns	3.997ns	590.19ns	0.30	3	0.0887	-	560 B

current	Span	256	1,969.66ns	19.996ns	18.705ns	1,970.26ns	1.00	5	0.1869	-	1176 B
spanCopy	Span	256	632.78ns	6.331ns	5.922ns	634.22ns	0.32	1	0.0887	-	560 B
spanStrCreate	Span	256	732.16ns	5.446ns	5.094ns	733.44ns	0.37	3	0.1030	-	648 B
spanStrCreateAlt	Span	256	772.54ns	5.266ns	4.668ns	773.87ns	0.39	4	0.1078	-	680 B
spanStrCreateUnx4	Span	256	665.30ns	10.455ns	9.779ns	665.44ns	0.34	2	0.1078	-	680 B

1024 chars
current	Candidate	1024	7,629.39ns	60.810ns	53.907ns	7,625.62ns	1.00	3	0.6714	-	4248 B
fastMap	Candidate	1024	3,082.48ns	26.745ns	25.017ns	3,070.76ns	0.40	2	0.9918	0.0038	6240 B
fasterMap	Candidate	1024	2,815.90ns	11.005ns	8.592ns	2,815.05ns	0.37	1	0.6638	-	4168 B
smallMap	Candidate	1024	2,804.66ns	35.226ns	31.227ns	2,801.53ns	0.37	1	0.6638	-	4168 B

current	Alt	1024	7,653.67ns	66.666ns	62.360ns	7,651.03ns	1.00	5	0.6714	-	4248 B
whileMap	Alt	1024	2,820.49ns	16.317ns	15.263ns	2,820.79ns	0.37	3	0.6638	-	4168 B
allocateMap	Alt	1024	3,046.43ns	17.272ns	14.423ns	3,047.76ns	0.40	4	0.6638	-	4168 B
unrollMapX2	Alt	1024	2,690.30ns	26.892ns	22.456ns	2,687.61ns	0.35	2	0.6638	-	4168 B
unrollMapX4	Alt	1024	2,530.67ns	14.308ns	13.383ns	2,531.89ns	0.33	1	0.6638	-	4168 B

current	Fixed	1024	7,647.49ns	51.204ns	47.897ns	7,635.90ns	1.00	4	0.6714	-	4248 B
fixedCopy	Fixed	1024	2,121.33ns	29.708ns	27.789ns	2,118.47ns	0.28	1	0.3319	-	2096 B
fixedCopyx2	Fixed	1024	2,164.05ns	11.677ns	9.117ns	2,164.54ns	0.28	2	0.3319	-	2096 B
fixedCopyx4	Fixed	1024	2,263.88ns	23.129ns	20.504ns	2,255.20ns	0.30	3	0.3319	-	2096 B

current	Span	1024	7,669.55ns	67.241ns	62.898ns	7,679.77ns	1.00	4	0.6714	-	4248 B
spanCopy	Span	1024	2,397.11ns	16.097ns	14.269ns	2,398.97ns	0.31	1	0.3319	-	2096 B
spanStrCreate	Span	1024	2,814.25ns	43.848ns	41.016ns	2,815.72ns	0.37	3	0.3471	-	2184 B
spanStrCreateAlt	Span	1024	2,865.30ns	22.238ns	20.801ns	2,866.39ns	0.37	3	0.3510	-	2216 B
spanStrCreateUnx4	Span	1024	2,448.91ns	35.363ns	31.348ns	2,452.51ns	0.32	2	0.3510	-	2216 B

8192 chars
current	Candidate	8192	61,304.72ns	350.006ns	327.396ns	61,285.96ns	1.00	3	5.1270	0.2441	32920 B
fastMap	Candidate	8192	25,785.01ns	305.116ns	285.405ns	25,831.80ns	0.42	2	7.8125	0.3052	49248 B
fasterMap	Candidate	8192	23,305.23ns	121.598ns	113.743ns	23,323.38ns	0.38	1	5.2185	0.1526	32840 B
smallMap	Candidate	8192	23,291.14ns	125.680ns	117.561ns	23,337.55ns	0.38	1	5.2185	0.1526	32840 B

current	Alt	8192	61,967.77ns	536.539ns	501.879ns	61,816.82ns	1.00	5	5.1270	0.2441	32920 B
whileMap	Alt	8192	23,592.00ns	328.376ns	307.163ns	23,607.09ns	0.38	3	5.2185	0.1526	32840 B
allocateMap	Alt	8192	24,950.07ns	201.715ns	178.815ns	24,959.22ns	0.40	4	5.2185	0.1526	32840 B
unrollMapX2	Alt	8192	22,003.28ns	116.477ns	97.264ns	22,005.69ns	0.36	2	5.2185	0.1526	32840 B
unrollMapX4	Alt	8192	21,074.46ns	333.249ns	295.416ns	21,083.36ns	0.34	1	5.2185	0.1526	32840 B

current	Fixed	8192	61,401.46ns	508.529ns	450.797ns	61,337.03ns	1.00	4	5.1270	0.2441	32920 B
fixedCopy	Fixed	8192	16,758.64ns	118.942ns	111.258ns	16,766.96ns	0.27	1	2.5940	-	16432 B
fixedCopyx2	Fixed	8192	17,569.22ns	151.468ns	141.683ns	17,581.75ns	0.29	2	2.5940	-	16432 B
fixedCopyx4	Fixed	8192	18,269.65ns	119.090ns	105.570ns	18,253.54ns	0.30	3	2.5940	-	16432 B

current	Span	8192	62,410.59ns	785.822ns	735.058ns	62,375.61ns	1.00	4	5.1270	0.2441	32920 B
spanCopy	Span	8192	19,228.95ns	113.409ns	94.702ns	19,240.35ns	0.31	1	2.5940	-	16432 B
spanStrCreate	Span	8192	22,603.78ns	168.023ns	157.169ns	22,597.73ns	0.36	3	2.6245	-	16520 B
spanStrCreateAlt	Span	8192	21,961.11ns	321.794ns	301.006ns	21,942.30ns	0.35	2	2.6245	-	16552 B
spanStrCreateUnx4	Span	8192	18,994.96ns	241.445ns	214.034ns	18,928.00ns	0.30	1	2.6245	-	16552 B

The code:

module StringMap =

    /// Basic String.map replacement, Array.map over CharArray
    /// dotnet mem: 6x(!) string size
    let fastMap mapper str =
        if String.IsNullOrEmpty str then String.Empty
        else
            str.ToCharArray()
            |> Array.map mapper
            |> String

    /// A faster String.map replacement, inlining the loop instead of Array.map. Uses CharArray.
    /// NOTES: used as baseline for StringMapUnsafe
    /// dotnet mem: 4x string size
    let fasterMap mapper str =
        if String.IsNullOrEmpty str then String.Empty
        else
            let result = str.ToCharArray()
            let mutable i = 0
            for c in result do
                result.[i] <- mapper c
                i <- i + 1

            String result

    /// Instead of for..in uses while. Otherwise same as fasterMap
    /// dotnet mem: 4x string size
    let whileMap mapper str =
        if String.IsNullOrEmpty str then String.Empty
        else
            let result = str.ToCharArray()
            let mutable i = 0
            while i < result.Length do
                result.[i] <- mapper result.[i]
                i <- i + 1

            String result

    /// Like fasterMap, but uses Array.zeroCreate with string-length
    /// dotnet mem: 4x string data
    let allocateMap mapper str =
        if String.IsNullOrEmpty str then String.Empty
        else
            let result = Array.zeroCreate str.Length
            let mutable i = 0
            for c in str do
                result.[i] <- mapper c
                i <- i + 1

            String result

    /// String.map equiv. optimized for very small maps. The break-even point is around 4 chars.
    let smallMap (mapper: char -> char) (str: string) =
        let len = String.length str
        if len > 4 then
            let result = str.ToCharArray()
            let mutable i = 0
            for c in result do
                result.[i] <- mapper c
                i <- i + 1

            String result

        elif len > 2 then
            if len = 3 
            then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
            else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

        elif len = 2 then
            String [|mapper str.[0]; mapper str.[1]|]

        elif len = 1 then
            (mapper str.[0]).ToString()

        else
            String.Empty


    /// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above. It suffers from non-elimination of range-checks.
    let unrollMapx2 (mapper: char -> char) (str: string) =
        let len = String.length str

        if len > 4 then
            let result = str.ToCharArray()
            let mutable i = 0
            while i < len - len % 2 do
                result.[i] <- mapper result.[i]
                result.[i + 1] <- mapper result.[i + 1]
                i <- i + 2

            // the remainder
            if len % 2 = 1 then
                result.[i] <- mapper result.[i]
                //i <- i + 1

            String result

        elif len > 2 then
            if len = 3 
            then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
            else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

        elif len = 2 then
            String [|mapper str.[0]; mapper str.[1]|]

        elif len = 1 then
            (mapper str.[0]).ToString()

        else
            String.Empty


    /// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above. 
    /// It suffers from non-elimination of range-checks.
    let unrollMapx4 (mapper: char -> char) (str: string) =
        let len = String.length str

        if len > 4 then
            let result = str.ToCharArray()
            let mutable i = 0
            while i < len - len % 4 do
                result.[i] <- mapper result.[i]
                result.[i + 1] <- mapper result.[i + 1]
                result.[i + 2] <- mapper result.[i + 2]
                result.[i + 3] <- mapper result.[i + 3]
                i <- i + 4

            // the remainder
            while i < len do
                result.[i] <- mapper result.[i]
                i <- i + 1

            String result

        elif len > 2 then
            if len = 3 
            then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
            else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

        elif len = 2 then
            String [|mapper str.[0]; mapper str.[1]|]

        elif len = 1 then
            (mapper str.[0]).ToString()

        else
            String.Empty


module StringMapUnsafe =
    /// Unrollx2. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
    let fixedCopyx2 (mapper: char -> char) (str: string) =

        let len = String.length str
        let result = String.Copy str
        use shadow = fixed result
        let mutable i = 0
        while i < len - len % 2 do
            NativePtr.set shadow i (mapper str.[i])
            NativePtr.set shadow (i + 1) (mapper str.[i + 1])
            i <- i + 2

        while i < len do
            NativePtr.set shadow i (mapper str.[i])
            i <- i + 1

        result

    /// Unrollx4. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
    let fixedCopyx4 (mapper: char -> char) (str: string) =
        let len = String.length str
        
        let result = String.Copy str
        use shadow = fixed result
        let mutable i = 0
        while i < len - len % 4 do
            NativePtr.set shadow i (mapper str.[i])
            NativePtr.set shadow (i + 1) (mapper str.[i + 1])
            NativePtr.set shadow (i + 1) (mapper str.[i + 2])
            NativePtr.set shadow (i + 1) (mapper str.[i + 3])
            i <- i + 4

        while i < len do
            NativePtr.set shadow i (mapper str.[i])
            i <- i + 1

        result

    /// Unrollx4. Create copy with String.Copy + fixed and use NativePtr to read and write from it, return modified string.
    let fixedCopy (mapper: char -> char) (str: string) =
        let len = String.length str

        /// a copy to use as result (we don't want to override the input reference string!)
        let result = String.Copy str

        /// pointer to the result-string
        use shadow = fixed result

        // loop-unroll
        let mutable i = 0
        while i < len - len % 4 do
            // NOTES:
            // - read/write to same mem address appears to be somewhat faster than reading from string by index, like with str.[i]
            // - while less IL instr are generated with calculating the address ourselves and then NativePtr.read/write, this appears to be faster (and safer).
            NativePtr.set shadow i (mapper (NativePtr.get shadow i))
            NativePtr.set shadow (i + 1) (mapper (NativePtr.get shadow (i + 1)))
            NativePtr.set shadow (i + 2) (mapper (NativePtr.get shadow (i + 2)))
            NativePtr.set shadow (i + 3) (mapper (NativePtr.get shadow (i + 3)))
            i <- i + 4

        // process the remainder
        while i < len do
            NativePtr.set shadow i (mapper (NativePtr.get shadow i))
            i <- i + 1

        result

    /// Unrollx4. Create copy with String.Copy, ReadOnlySpan over it, marshal-as read/write Span<char> over same copy. Return modified underlying string.
    /// Performance: ~20-30% faster than fasterMap.
    let spanCopy (mapper: char -> char) (str: string) =
        let len = String.length str

        let resultStr = String.Copy str
        let roSpan = str.AsSpan()
        let writeSpan = MemoryMarshal.CreateSpan(&MemoryMarshal.GetReference roSpan, len)
        
        // loop-unroll
        let mutable i = 0
        while i < len - len % 4 do
            writeSpan.[i] <- mapper roSpan.[i]
            writeSpan.[i + 1] <- mapper roSpan.[i + 1]
            writeSpan.[i + 2] <- mapper roSpan.[i + 2]
            writeSpan.[i + 3] <- mapper roSpan.[i + 3]
            i <- i + 4

        // process the remainder
        while i < len do
            writeSpan.[i] <- mapper roSpan.[i]
            i <- i + 1

        resultStr

    /// No unroll, uses String.Create with Span.
    /// Performance: 0-20% faster than fasterMap, but fast only on larger (64kb) input
    let spanCreate1 (mapper: char -> char) (str: string) =

        let len = String.length str
        String.Create(len, str, fun buf v -> 
            let mutable i = 0
            for c in v do
                buf.[i] <- mapper c
                i <- i + 1)

    /// No unroll, uses String.Create with Span, no closure
    /// Performance: 0-22% faster than fasterMap, the larger the intput, the bigger the gain (sometimes slower!)
    let spanCreate2 (mapper: char -> char) (str: string) =

        let len = String.length str
        String.Create(len, (str, mapper), fun buf (str, f) -> 
            let mutable i = 0
            for c in str do
                buf.[i] <- f c
                i <- i + 1)

    /// Unrollx4, uses String.Create with Span. All indexed operations on spans.
    /// Performance: 15-32% (!) faster than fasterMap, the larger the input, the bigger the gain
    let spanCreateUnrollx4 (mapper: char -> char) (str: string) =

        let len = String.length str
        String.Create(len, (str, mapper), fun target (str, f) -> 
            let mutable i = 0
            let source = str.AsSpan()
            while i < len - len % 4 do
                target.[i] <- f source.[i]
                target.[i + 1] <- f source.[i + 1]
                target.[i + 2] <- f source.[i + 2]
                target.[i + 3] <- f source.[i + 3]
                i <- i + 4

            // process the remainder
            while i < len do
                target.[i] <- f source.[i]
                i <- i + 1)

abelbraaksma · 2020-06-09T13:04:00Z

@dsyme or @cartermp, can I get a confirmation that using fixed and pointer arithmetic is off limits (or not)? If it's allowed, I gladly use it, as it gives dramatic reduction in memory usage and increases performance, but obviously, it requires more careful coding and it will automatically mark certain functions as unsafe, which I think is a "no go". Also, it may be considered fine on arrays, but not on strings, I'd totally understand that.

There are already dramatic perf improvements using other methods, but fixed trumps all. And before I get to the array functions, I like to limit my scope (consider the above an exercise in what's possible).

I know Span cannot be used, otherwise that would be a good alternative and gets close to fixed in terms of memory and speed.

cartermp · 2020-06-09T17:30:43Z

Yes, spans are off-limits (unless we decide to move FSharp.Core to also bring in a dependency on System.Memory). I'd actually love to do that but I know that @dsyme is quite allergic to them.

I think we should strive for consistency in implementation here. Array and list internals tend not to use pointers to accomplish things. I'd like for us to eek out as much perf and memory as possible if we can, but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety. So the second-best implementation is probably sufficient, and a clear upgrade anyways.

abelbraaksma · 2020-06-09T18:46:43Z

unless we decide to move FSharp.Core to also bring in a dependency on System.Memory

If, for ImmutableArray, we decide to add that dependency, then I believe that includes System.Memory:

(which in turn gives us also System.Buffers and System.Numerics.Vectors, each of which open new ways of potentially improving certain key performance areas)

Array and list internals tend not to use pointers to accomplish things.
<snip/>but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety.

Agreed. Yeah, Span would give us wings in some areas :).

Ok, I'll move forward without the "extreme" optimizations than.

abelbraaksma · 2020-06-09T20:32:12Z

Next up: String.concat. This one does nothing more than calling into String.Join, but since it takes a sequence for the concatenation and doesn't special-case the empty separator string, I figured, let's see what happens if I special case for array and list (if anyone know of others that should be special-cased, please let me know. For instance, I wonder if there's a safe way for library functions to get the underlying array or list if the sequence is created with Set.ofArray or Seq.ofList, so that people won't get punished for using that instead of casting().

TLDR conclusions:

Code changes for this one are simple and straightforward
Speedup of up to ~50% gain for array and up to 20% gain for lists. Seq stays the same.
Mem decrease: almost 60% less mem use for arrays, and ~40% for lists
GC pressure drops up to 50% except for very small lists and arrays

I think these numbers can be expected considering that String.Join uses an optimized and mutable string under the hood with pre-calculated size for Concat and when an array is passed. String.Join with IEnumerable uses a more resource-hungry StringBuilder.

Below: bold is fastest. I didn't select a champion if it was within margins of error.

BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
  [Host]     : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
  Job-FRSYQM : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

MaxRelativeError=0.02  Runtime=.NET Core 3.1  MaxIterationCount=10  
MaxWarmupIterationCount=5  MinIterationCount=5  MinWarmupIterationCount=2

Method	Sep	CollSize	CollType	Mean	Error	StdDev	Ratio	Rank	Gen 0	Gen 1	Allocated
current		Small	Array	484.4ns	7.61ns	1.98ns	1.00	2	0.0467	-	296 B
concatOpt		Small	Array	256.3ns	4.80ns	1.25ns	0.53	1	0.0420	-	264 B

current		Small	List	554.0ns	8.99ns	3.21ns	1.00	2	0.0477	-	304 B
concatOpt		Small	List	391.7ns	7.23ns	2.58ns	0.71	1	0.0710	-	448 B

current		Small	Seq	541.7ns	10.36ns	7.49ns	1.00	1	0.0467	-	296 B
concatOpt		Small	Seq	556.4ns	9.02ns	2.34ns	1.02	2	0.0467	-	296 B

current		Medium	Array	10,715.8ns	148.95ns	38.68ns	1.00	2	2.1667	0.0916	13656 B
concatOpt		Medium	Array	4,674.7ns	92.60ns	72.29ns	0.43	1	0.7553	0.0076	4784 B

current		Medium	List	12,422.7ns	188.90ns	29.23ns	1.00	2	2.1667	0.0916	13664 B
concatOpt		Medium	List	7,951.7ns	152.82ns	175.98ns	0.64	1	1.2665	0.0153	8008 B

current		Medium	Seq	12,905.6ns	435.52ns	651.86ns	1.00	2	2.1667	0.0916	13656 B
concatOpt		Medium	Seq	11,461.7ns	212.56ns	94.38ns	0.88	1	2.1667	0.0916	13656 B

current	&\|&	Small	Array	594.5ns	11.54ns	14.17ns	1.00	2	0.0648	-	408 B
concatOpt	&\|&	Small	Array	402.5ns	7.20ns	2.57ns	0.69	1	0.0596	-	376 B

current	&\|&	Small	List	654.4ns	12.02ns	8.69ns	1.00	2	0.0658	-	416 B
concatOpt	&\|&	Small	List	524.0ns	9.51ns	4.22ns	0.80	1	0.0887	-	560 B

current	&\|&	Small	Seq	624.0ns	12.35ns	12.13ns	1.00	1	0.0648	-	408 B
concatOpt	&\|&	Small	Seq	626.8ns	11.61ns	9.70ns	1.01	1	0.0648	-	408 B

current	&\|&	Medium	Array	13,245.0ns	255.18ns	133.47ns	1.00	2	2.5482	0.1373	16048 B
concatOpt	&\|&	Medium	Array	7,656.2ns	127.40ns	66.63ns	0.58	1	1.1368	0.0229	7176 B

current	&\|&	Medium	List	14,149.6ns	220.37ns	78.59ns	1.00	2	2.5482	0.1373	16056 B
concatOpt	&\|&	Medium	List	10,824.5ns	183.83ns	81.62ns	0.77	1	1.6479	0.0458	10400 B

current	&\|&	Medium	Seq	13,809.2ns	258.95ns	154.10ns	1.00	1	2.5482	0.1373	16048 B
concatOpt	&\|&	Medium	Seq	15,213.7ns	565.78ns	846.83ns	1.07	2	2.5330	0.1221	16048 B

Big inputs, 819200 items, 4874240 chars total, ~9.3MB

Method	Sep	CollType	Mean	Error	StdDev	Ratio	Rank	Gen 0	Gen 1	Gen 2	Allocated
current		Array	27.00ms	0.351ms	0.091ms	1.00	2	1656.2500	718.7500	125.0000	18.65 MB
concatOpt		Array	10.27ms	0.155ms	0.069ms	0.38	1	-	-	-	9.3 MB

current		List	32.12ms	0.201ms	0.052ms	1.00	2	1656.2500	718.7500	125.0000	18.65 MB
concatOpt		List	22.57ms	0.682ms	0.357ms	0.71	1	-	-	-	15.55 MB

current		Seq	29.01ms	0.140ms	0.036ms	1.00	1	1656.2500	718.7500	125.0000	18.65 MB
concatOpt		Seq	30.62ms	0.600ms	0.314ms	1.06	2	62.5000	62.5000	62.5000	31.55 MB

current	&\|&	Array	32.07ms	0.707ms	0.109ms	1.00	2	2250.0000	875.0000	187.5000	24.92 MB
concatOpt	&\|&	Array	16.51ms	0.259ms	0.115ms	0.51	1	-	-	-	12.42 MB

current	&\|&	List	36.85ms	0.505ms	0.131ms	1.00	2	2285.7143	928.5714	214.2857	24.92 MB
concatOpt	&\|&	List	28.60ms	0.564ms	0.250ms	0.78	1	-	-	-	18.67 MB

current	&\|&	Seq	33.47ms	0.580ms	0.151ms	1.00	1	2266.6667	933.3333	200.0000	24.92 MB
concatOpt	&\|&	Seq	33.34ms	0.453ms	0.118ms	1.00	1	2266.6667	933.3333	200.0000	24.92 MB

The source code:

let private concatArray sep (strings: string []) =
    match String.length sep with
    | 0 -> String.Concat strings
    | 1 -> String.Join(sep.[0], strings, 0, strings.Length)
    | _ -> String.Join(sep, strings, 0, strings.Length)

let concatOpt sep (strings: seq<string>) =
    match strings with
    | :? array<string> as arr -> concatArray sep arr
    | :? list<string> as lst -> lst |> List.toArray |> concatArray sep
    | _ ->
        // special-casing Seq for sep = String.Empty is *not* faster due to the involved 
        // overhead of Seq.toArray and the already well-optimized String.Join
        String.Join(sep, strings)

abelbraaksma · 2020-06-11T23:49:42Z

I think using screenshots is a little bit easier to read, those tables take too much space horizontally and vertically, making it hard to read.

`String.iter`, unrolled

There was little that could be improved for String.iter, but unrolling showed a ~25% perf gain:

`String.iteri`, unrolled

Basically same story:

`String.mapi`

Performance characteristics were, unsurprisingly, much the same as for String.map above. An easy gain of 2x performance, and reduction in mem and GC pressure.

Real advancement can be made here with fixed or with Span when the time's there. Just like String.map, unrolling had limited effect, but was more significant than for String.map. I suggest to take the unrollx2 variant (x4 and x8 had a difference of ~2%, not worth the effort). This would lead to 2.5x performance gain.

`String.filter`

Since Array.filter is highly optimized with a bit-mask method, I decided to try a method where I simply use Array.filter on the string. This improved perf dramatically by some 1.7-2.5 times, depending on whether the filter is true on "all" characters, "some" or "none" (false for all).

The buildAndCut method zero-allocates the target array and uses a String-overload to create a new string of only the filters characters. I thought this would have more mem pressure, but result was better in both mem and speed.

let filterByArray predicate (str: string) =
    if String.IsNullOrEmpty str then String.Empty
    else
        str.ToCharArray()
        |> Array.filter predicate
        |> String

let filterBuildAndCut predicate (str: string) =
    let len = String.length str

    if len = 0 then String.Empty
    else
        let target = Array.zeroCreate len
        let mutable i = 0
        for c in str do
            if predicate c then 
                target.[i] <- c
                i <- i + 1

        String(target, 0, i)

Still TODO:

String.collect
String.init
String.replicate
String.forall
String.exists
String.length

abelbraaksma · 2020-06-14T15:19:10Z

Next set:

`String.length`

At first it seems not much can be done here, but at close inspection, it calls String.get_Length twice on the happy path. Changing String.IsNullOrEmpty to isNull gives a 33% perf improvement. Since this is such a small function, it'll only show when used in tight loops. Anyway, it's as simple as changing the current impl. to this:

let nullOnly (str:string) =
    if isNull str then 0
    else str.Length

Fun in a loop to make is measurable, adjusted with OperationsPerInvoke:

(note that this is not a pure measurement, as there's auxiliary code present, but the difference between the two is only the body of the String.length function.)

It's interesting to take a peak at the disassembly here. If we zoom in on the body of the loop, which is inlined by the JIT, this is what the original String.length gave us. As can be seen, this does two pointer-dereferences to calculate the length:

; FSharp.Perf.BenchLength.current()
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       xor       edx,edx
M00_L00:  // loop body, i.e., String.length
       test      rax,rax              // rax & rax, in other words: null-check, sets flags
       je        short M00_L01        // if zero (null) jump back to loop
       cmp       dword ptr [rax+8],0  // compare string-length to 0
       jbe       short M00_L01        // if equal, jump back to loop
       mov       ecx,[rax+8]          // if not, get string length fallthrough to loop
M00_L01:
       inc       edx
       cmp       edx,2711
       jl        short M00_L00
       add       rsp,28
       ret
; Total bytes of code 43

When we compare that to the disassembly where we only use isNull, the pointer dereference is gone completely and we have only three short instructions left in the body:

; FSharp.Perf.BenchLength.nullOnly()
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       xor       edx,edx
M00_L00:  // loop body, i.e., nullOnly method
       test      rax,rax        // rax & rax, for null-check, sets flags
       je        short M00_L01  // if zero (null) jump back to loop
       mov       ecx,[rax+8]    // get the string-length from the string and fallthrough to loop
M00_L01:
       inc       edx
       cmp       edx,2711
       jl        short M00_L00
       add       rsp,28
       ret
; Total bytes of code 37

As you can see, we are down from five assembly instructions (incl two pointer dereferences) to only three assembly instructions.

Maybe a bit much of an analysis for such a small function, but it was fun to check what goes on under the hood. Note that the assembly can look different in other scenarios because of JIT optimizations.

It's also good to find out that the JIT does not do an extra null-check for the callvirt IL instruction, as it already appears to know that the reference cannot be null at that point. In fact, the whole call to get_Length disappears and essentially boils down to a single mov ecx, [rax+8].

`String.exists` && `String.forall`

This method takes a predicate that operates on the char. There's no direct equiv. in the .NET stdlibs. I've tried several alternatives of coding this, esp. since at first sight it seems the original is not tail-recursive, but this is recognized by the JIT anyway and it ~~is unrolled in~~ becomes an efficient loop.

Same is true for String.forall.

The only change to make here is the redundant call to get_Length, which only has an impact when searching very short strings.

tbc...

abelbraaksma · 2020-06-15T13:16:54Z

Next up:

`String.replicate`

There're some serious optimizations that are possible here, esp. if we would open up use of cpblk or initblk (they've been optimized, see dotnet/coreclr#7198; we'll maybe get there when we start with array functions), but for this round I'll stick to optimizing with a simple Array.Copy, which internally calls an optimized native method already.

This gives us between 1.2x and 8x (!) performance improvements. Occasionally, the Array.Copy method can be a few percents slower, but this seems to be influenced by LOH when doing this on really large strings and counts. Since the general case is likely small strings, this greatly improves the general case.

The number of iterations and memcopies is now O(log(n)) instead of O(n), and there's no redundant memory allocated (other than for the required copy of the string at the end).

The optimization in code:

    let replicateOpt count (str:string) =
        if isNull str then
            String.Empty
        else
            let len = str.Length
            let target = Array.zeroCreate (len * count)
            let source = str.ToCharArray()
            Array.Copy(source, 0, target, 0, len)
            let mutable i = len
            while i * 2 < target.Length do
                Array.Copy(target, 0, target, i, i)
                i <- i * 2

            Array.Copy(target, 0, target, i, target.Length - i)
            String target

Speed:

Memory:

`String.init`

This one proves to be tricky to optimize, as the performance distribution of both the array-based new version and the existing implementation based on StringBuilder is not uniform for both memory and speed. This makes comparison hard. When Count is small (number of iterations) then the new version is the clear winner, but at larger counts, esp. if the returned strings are small (range 1-2 chars), the original is much faster and uses less memory.

When it comes a random distribution of strings in the range 0-200 chars, the difference is neglible*:

^{* EDIT: the "empty" category wasn't empty, but same as "5or16", by accident. And "random" used Random::Next, which caused too much noise. You can disregard those rows. Oh, and replicateOpt really is initOpt, just missed while renaming.}

Unless I can think of something better (unrolling the original, had no effect either), this one should probably remain as-is, since we won't be able to predict the average result of the initialize function.

The code:

let initOpt count (initializer: int -> string) =
    let target = Array.zeroCreate count
    for i=0 to count - 1 do 
        target.[i] <- initializer i

    String.Concat target

tbc (only String.collect remains)

abelbraaksma · 2020-06-16T12:19:42Z

`String.init` revisited

I couldn't let go. From the prev. measurements, it was clear that using the highly optimized String.Concat was significantly faster on small counts, so I leveraged that and now anything below count = 8 is between 12x (!!) and 1.1x faster, most over 2x faster, depending on size of count and size of input (larger strings means greater improvement).

Memory use and GC pressure also drops by 2x - 8x.

The code (from count > 4 it becomes a paramarray, also note that when count > 8 will be put up top by the compiler):

/// Using ParamArray as an optimized version of Array gives much better performance on small counts (up to 10x faster).
let initOpt (count:int) (initializer: int-> string) =
    match count with
    | 0 -> String.Empty
    | 1 -> initializer 0
    | 2 -> String.Concat(initializer 0, initializer 1)
    | 3 -> String.Concat(initializer 0, initializer 1, initializer 2)
    | 4 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3)
    | 5 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4)
    | 6 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5)
    | 7 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5, initializer 6)
    | 8 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5, initializer 6, initializer 7)
    | _ when count > 8 ->
        let res = StringBuilder count
        for i = 0 to count - 1 do 
            res.Append(initializer i) |> ignore
        res.ToString()
    | _ ->
        failwith "error"  // will become invalidArgInputMustBeNonNegative

abelbraaksma · 2020-06-16T17:27:50Z

`String.init` revisited (2) and final

It turned out that through copy and pasting I made mistakes on the first measurements, wrongly believing that using the array-based approach was so bad. I stabilized "random" between runs, so that all measurements can be compared, and I fixed the code where the wrong functions were called.

The fix now special-cases for very small amounts and unrolls the loop a little
Special-casing was less fast than array-based above count = 4, so I dropped that
If the initialize function only returns very small strings of 1-3 chars, there's a performance degradation
If the output of initialize is 4 or 5 chars, it evens out, if it is random or larger than that, or zero chars, the array-based version wins in all cases
Memory pressure is, on average 60% less than before, GC pressure oftentimes even 4x better.

Let's get the numbers:

The code is now slightly more complex, but still not by much:

/// 2-10x faster on small counts, about 20% faster for very large count, for all cases, less memory pressure.
let initWithArray count (initializer: int -> string) =
    if count < 5 then
            match count with
            | 0 -> String.Empty
            | 1 -> initializer 0
            | 2 -> String.Concat(initializer 0, initializer 1)
            | 3 -> String.Concat(initializer 0, initializer 1, initializer 2)
            | 4 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3)
            | _ -> failwith "error" // arg error
    else
        let target = Array.zeroCreate count
        let mutable i = 0
        while i < count - count % 4 do 
            target.[i] <- initializer i
            target.[i + 1] <- initializer(i + 1)
            target.[i + 2] <- initializer(i + 2)
            target.[i + 3] <- initializer(i + 3)
            i <- i + 4

        while i < count do
            target.[i] <- initializer i
            i <- i + 1

        String.Concat target

If we zoom in on the break-even point for when the old version was faster than the new version, it is here, And as you can see, still less GC pressure:

forki · 2020-06-16T17:49:04Z

Are you planning to send fixes?

abelbraaksma · 2020-06-16T18:00:12Z

@forki, absolutely, as soon as the string functions are done (one left), I'll send in a PR. I figured it made more sense to first do the exercise, timings etc, so that the reasoning behind the fixes, which aren't always obvious, are at least logged.

forki · 2020-06-16T18:03:01Z

Please do it as separate PRs so that it's easier to get those accepted Abel Braaksma <notifications@github.com> schrieb am Di., 16. Juni 2020, 20:00:

…

@forki <https://github.com/forki>, absolutely, as soon as the string functions are done (one left), I'll send in a PR. I figured it made more sense to first do the exercise, so that the reasoning behind the fixes, which aren't always obvious, are at least logged. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9390 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAOANELTPMSWGDOGWRTFATRW6XLVANCNFSM4NR6H2KA> .

abelbraaksma · 2020-06-16T18:05:26Z

Please do it as separate PRs so that it's easier to get those accepted

@forki I was figuring to do one PR for the string stuff (others for array or whatever else comes my way), which is really just a few lines in a single file, but if you think it is better to do it function-by-function, I can do that, but I thought that would make it harder instead of easier...

@cartermp, @TIHan, @dsyme, any preference here so that I don't have to do the PR's twice?

forki · 2020-06-16T18:08:33Z

From experience in this repo I can say that bigger performance PRs will sometimes have a hard time to get in. Abel Braaksma <notifications@github.com> schrieb am Di., 16. Juni 2020, 20:05:

…

Please do it as separate PRs so that it's easier to get those accepted I was figuring to do one PR for the string stuff, which is really just a few lines in a single file <https://github.com/dotnet/fsharp/blob/master/src/fsharp/FSharp.Core/string.fs>, but if you think it is better to do it function-by-function, I can do that, but I thought that would make it harder instead of easier... @cartermp <https://github.com/cartermp>, @TIHan <https://github.com/TIHan>, @dsyme <https://github.com/dsyme>, any preference here so that I don't have to do the PR's twice? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9390 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAOANA5G7FVK3G2I2ZPV4DRW6X7LANCNFSM4NR6H2KA> .

cartermp · 2020-06-16T22:54:03Z

Yep, the smaller/more contained the better

abelbraaksma · 2020-06-17T00:29:18Z

Ok, got it, I'll make a separate pr for each function then,and I'll link it to the relevant comment.

…or details.

abelbraaksma · 2020-06-17T20:02:49Z

@KevinRansom, an example that uses buffered StringBuilder, as we discussed, is the String.concat function. On sequences, it performs better than other methods. It uses System.String::Join, which internally uses the buffered SB (but, even the BCL does not use that if the input is an array). See the timings here: #9390 (comment)

KevinRansom · 2020-06-17T20:13:37Z

@abelbraaksma , thanks I will look at it, and try to understand why.

cartermp · 2020-12-10T23:11:49Z

@abelbraaksma how far along do you think this one is in terms of getting to closure?

abelbraaksma · 2020-12-22T01:55:40Z

@cartermp, I think there are quite a few improvements already, but I've yet to check in the array functions improvements. I might get another stab at it during the holidays (won't be going anywhere anyway, it's still 2020:))

dsyme · 2022-03-30T06:19:45Z

I'm going to close this out - @abelbraaksma This was an amazing analysis - can you add new PRs or issues for any other specific small improvements you still see as possible?

abelbraaksma · 2022-03-30T11:55:41Z

@dsyme, sure! I forgot about this one, the thread is still useful for other areas of optimization, but I agree that they should go into smaller, more containable issues and PRs.

abelbraaksma mentioned this issue Jun 12, 2020

not(isNull x) leads to very odd and partially unreachable IL code that performace 5x slower than redefining not yourself #9433

Closed

This was referenced Jun 17, 2020

Small improvement for String.length #9469

Merged

String map performance improvement #9470

Merged

abelbraaksma added a commit to abelbraaksma/fsharp that referenced this issue Jun 17, 2020

Perf improvement ~2.5x for String.mapi, see dotnet#9390 (comment) f…

4398811

…or details.

abelbraaksma mentioned this issue Jun 17, 2020

String mapi ~2.5x perf improvement #9474

Closed

abelbraaksma mentioned this issue Jun 17, 2020

Prep reorganize string module for String.xxx changes #9481

Merged

cartermp added Area-Library Issues for FSharp.Core not covered elsewhere discussion Feature Improvement labels Jul 12, 2020

cartermp added this to the Feature requests and improvements milestone Sep 1, 2020

cartermp modified the milestones: Feature requests and improvements, Backlog Oct 28, 2020

cartermp removed the discussion label Dec 10, 2020

KevinRansom added Feature Request and removed Feature Improvement labels Jun 30, 2021

dsyme closed this as completed Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of certain build-in functions, esp the ones related to Array manipulation #9390

Performance of certain build-in functions, esp the ones related to Array manipulation #9390

abelbraaksma commented Jun 3, 2020 •

edited

Loading

cartermp commented Jun 3, 2020

abelbraaksma commented Jun 3, 2020

abelbraaksma commented Jun 3, 2020 •

edited

Loading

cartermp commented Jun 3, 2020

TIHan commented Jun 4, 2020

abelbraaksma commented Jun 8, 2020 •

edited

Loading

abelbraaksma commented Jun 9, 2020

cartermp commented Jun 9, 2020

abelbraaksma commented Jun 9, 2020

abelbraaksma commented Jun 9, 2020 •

edited

Loading

abelbraaksma commented Jun 11, 2020 •

edited

Loading

abelbraaksma commented Jun 14, 2020 •

edited

Loading

abelbraaksma commented Jun 15, 2020 •

edited

Loading

abelbraaksma commented Jun 16, 2020 •

edited

Loading

abelbraaksma commented Jun 16, 2020 •

edited

Loading

forki commented Jun 16, 2020

abelbraaksma commented Jun 16, 2020 •

edited

Loading

forki commented Jun 16, 2020 via email

abelbraaksma commented Jun 16, 2020 •

edited

Loading

forki commented Jun 16, 2020 via email

cartermp commented Jun 16, 2020

abelbraaksma commented Jun 17, 2020

abelbraaksma commented Jun 17, 2020 •

edited

Loading

KevinRansom commented Jun 17, 2020

cartermp commented Dec 10, 2020

abelbraaksma commented Dec 22, 2020 •

edited

Loading

dsyme commented Mar 30, 2022

abelbraaksma commented Mar 30, 2022

Performance of certain build-in functions, esp the ones related to Array manipulation #9390

Performance of certain build-in functions, esp the ones related to Array manipulation #9390

Comments

abelbraaksma commented Jun 3, 2020 • edited Loading

cartermp commented Jun 3, 2020

abelbraaksma commented Jun 3, 2020

abelbraaksma commented Jun 3, 2020 • edited Loading

cartermp commented Jun 3, 2020

TIHan commented Jun 4, 2020

abelbraaksma commented Jun 8, 2020 • edited Loading

0 and 1 chars

5 and 10 chars

31 and 100 chars

256 and more chars

abelbraaksma commented Jun 9, 2020

cartermp commented Jun 9, 2020

abelbraaksma commented Jun 9, 2020

abelbraaksma commented Jun 9, 2020 • edited Loading

Big inputs, 819200 items, 4874240 chars total, ~9.3MB

abelbraaksma commented Jun 11, 2020 • edited Loading

String.iter, unrolled

String.iteri, unrolled

String.mapi

String.filter

abelbraaksma commented Jun 14, 2020 • edited Loading

String.length

String.exists && String.forall

abelbraaksma commented Jun 15, 2020 • edited Loading

String.replicate

String.init

abelbraaksma commented Jun 16, 2020 • edited Loading

String.init revisited

abelbraaksma commented Jun 16, 2020 • edited Loading

String.init revisited (2) and final

forki commented Jun 16, 2020

abelbraaksma commented Jun 16, 2020 • edited Loading

forki commented Jun 16, 2020 via email

abelbraaksma commented Jun 16, 2020 • edited Loading

forki commented Jun 16, 2020 via email

cartermp commented Jun 16, 2020

abelbraaksma commented Jun 17, 2020

abelbraaksma commented Jun 17, 2020 • edited Loading

KevinRansom commented Jun 17, 2020

cartermp commented Dec 10, 2020

abelbraaksma commented Dec 22, 2020 • edited Loading

dsyme commented Mar 30, 2022

abelbraaksma commented Mar 30, 2022

abelbraaksma commented Jun 3, 2020 •

edited

Loading

abelbraaksma commented Jun 3, 2020 •

edited

Loading

abelbraaksma commented Jun 8, 2020 •

edited

Loading

abelbraaksma commented Jun 9, 2020 •

edited

Loading

abelbraaksma commented Jun 11, 2020 •

edited

Loading

`String.iter`, unrolled

`String.iteri`, unrolled

`String.mapi`

`String.filter`

abelbraaksma commented Jun 14, 2020 •

edited

Loading

`String.length`

`String.exists` && `String.forall`

abelbraaksma commented Jun 15, 2020 •

edited

Loading

`String.replicate`

`String.init`

abelbraaksma commented Jun 16, 2020 •

edited

Loading

`String.init` revisited

abelbraaksma commented Jun 16, 2020 •

edited

Loading

`String.init` revisited (2) and final

abelbraaksma commented Jun 16, 2020 •

edited

Loading

abelbraaksma commented Jun 16, 2020 •

edited

Loading

abelbraaksma commented Jun 17, 2020 •

edited

Loading

abelbraaksma commented Dec 22, 2020 •

edited

Loading