-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of certain build-in functions, esp the ones related to Array manipulation #9390
Comments
@abelbraaksma great observation. Would you be interested in submitting a PR to adjust these functions? |
@cartermp, sure thing, I just wasn't certain how much of "performance sugar" is allowed. I mean, I would assume using I'll submit a PR (will probably mark it WIP as this may need some iterations) and have you comment on whether certain performance-doping is allowed or not. |
@cartermp, one other question, for perf comparison, would it be enough to just run RuyJit with .NET 4.8 on 64 bit, plus one LegacyJit on x86, or should I make more system configurations (.NET Core, Standard, I don't have Mono installed...)? It should be enough to focus on the 80% users group right, and not focus too much on niche configurations? |
I think .NET Core (latest) and .NET Framework (any version) would be good enough to run some benchmarks against. Regarding performance sugar, yes |
I think a PR would be fine. |
After playing around a little with the various ways What I tried:
My own verdict:
Note that any alternative I tried was faster than the current implementation for average input. And almost all of them have less GC pressure and memory use. I've also tested with very large inputs up until 100MB, usually the perf difference shift in favor of the new algorithms, because much less memory is used. With 100MB, Benchmarks were run with BDN, using 0.01 max relative error. Originally each test was run with a variety of
BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
[Host] : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
.NET Core 3.1 : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
Job=.NET Core 3.1 Runtime=.NET Core 3.1
0 and 1 chars
5 and 10 chars
31 and 100 chars
256 and more chars
The code: module StringMap =
/// Basic String.map replacement, Array.map over CharArray
/// dotnet mem: 6x(!) string size
let fastMap mapper str =
if String.IsNullOrEmpty str then String.Empty
else
str.ToCharArray()
|> Array.map mapper
|> String
/// A faster String.map replacement, inlining the loop instead of Array.map. Uses CharArray.
/// NOTES: used as baseline for StringMapUnsafe
/// dotnet mem: 4x string size
let fasterMap mapper str =
if String.IsNullOrEmpty str then String.Empty
else
let result = str.ToCharArray()
let mutable i = 0
for c in result do
result.[i] <- mapper c
i <- i + 1
String result
/// Instead of for..in uses while. Otherwise same as fasterMap
/// dotnet mem: 4x string size
let whileMap mapper str =
if String.IsNullOrEmpty str then String.Empty
else
let result = str.ToCharArray()
let mutable i = 0
while i < result.Length do
result.[i] <- mapper result.[i]
i <- i + 1
String result
/// Like fasterMap, but uses Array.zeroCreate with string-length
/// dotnet mem: 4x string data
let allocateMap mapper str =
if String.IsNullOrEmpty str then String.Empty
else
let result = Array.zeroCreate str.Length
let mutable i = 0
for c in str do
result.[i] <- mapper c
i <- i + 1
String result
/// String.map equiv. optimized for very small maps. The break-even point is around 4 chars.
let smallMap (mapper: char -> char) (str: string) =
let len = String.length str
if len > 4 then
let result = str.ToCharArray()
let mutable i = 0
for c in result do
result.[i] <- mapper c
i <- i + 1
String result
elif len > 2 then
if len = 3
then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]
elif len = 2 then
String [|mapper str.[0]; mapper str.[1]|]
elif len = 1 then
(mapper str.[0]).ToString()
else
String.Empty
/// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above. It suffers from non-elimination of range-checks.
let unrollMapx2 (mapper: char -> char) (str: string) =
let len = String.length str
if len > 4 then
let result = str.ToCharArray()
let mutable i = 0
while i < len - len % 2 do
result.[i] <- mapper result.[i]
result.[i + 1] <- mapper result.[i + 1]
i <- i + 2
// the remainder
if len % 2 = 1 then
result.[i] <- mapper result.[i]
//i <- i + 1
String result
elif len > 2 then
if len = 3
then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]
elif len = 2 then
String [|mapper str.[0]; mapper str.[1]|]
elif len = 1 then
(mapper str.[0]).ToString()
else
String.Empty
/// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above.
/// It suffers from non-elimination of range-checks.
let unrollMapx4 (mapper: char -> char) (str: string) =
let len = String.length str
if len > 4 then
let result = str.ToCharArray()
let mutable i = 0
while i < len - len % 4 do
result.[i] <- mapper result.[i]
result.[i + 1] <- mapper result.[i + 1]
result.[i + 2] <- mapper result.[i + 2]
result.[i + 3] <- mapper result.[i + 3]
i <- i + 4
// the remainder
while i < len do
result.[i] <- mapper result.[i]
i <- i + 1
String result
elif len > 2 then
if len = 3
then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]
elif len = 2 then
String [|mapper str.[0]; mapper str.[1]|]
elif len = 1 then
(mapper str.[0]).ToString()
else
String.Empty
module StringMapUnsafe =
/// Unrollx2. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
let fixedCopyx2 (mapper: char -> char) (str: string) =
let len = String.length str
let result = String.Copy str
use shadow = fixed result
let mutable i = 0
while i < len - len % 2 do
NativePtr.set shadow i (mapper str.[i])
NativePtr.set shadow (i + 1) (mapper str.[i + 1])
i <- i + 2
while i < len do
NativePtr.set shadow i (mapper str.[i])
i <- i + 1
result
/// Unrollx4. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
let fixedCopyx4 (mapper: char -> char) (str: string) =
let len = String.length str
let result = String.Copy str
use shadow = fixed result
let mutable i = 0
while i < len - len % 4 do
NativePtr.set shadow i (mapper str.[i])
NativePtr.set shadow (i + 1) (mapper str.[i + 1])
NativePtr.set shadow (i + 1) (mapper str.[i + 2])
NativePtr.set shadow (i + 1) (mapper str.[i + 3])
i <- i + 4
while i < len do
NativePtr.set shadow i (mapper str.[i])
i <- i + 1
result
/// Unrollx4. Create copy with String.Copy + fixed and use NativePtr to read and write from it, return modified string.
let fixedCopy (mapper: char -> char) (str: string) =
let len = String.length str
/// a copy to use as result (we don't want to override the input reference string!)
let result = String.Copy str
/// pointer to the result-string
use shadow = fixed result
// loop-unroll
let mutable i = 0
while i < len - len % 4 do
// NOTES:
// - read/write to same mem address appears to be somewhat faster than reading from string by index, like with str.[i]
// - while less IL instr are generated with calculating the address ourselves and then NativePtr.read/write, this appears to be faster (and safer).
NativePtr.set shadow i (mapper (NativePtr.get shadow i))
NativePtr.set shadow (i + 1) (mapper (NativePtr.get shadow (i + 1)))
NativePtr.set shadow (i + 2) (mapper (NativePtr.get shadow (i + 2)))
NativePtr.set shadow (i + 3) (mapper (NativePtr.get shadow (i + 3)))
i <- i + 4
// process the remainder
while i < len do
NativePtr.set shadow i (mapper (NativePtr.get shadow i))
i <- i + 1
result
/// Unrollx4. Create copy with String.Copy, ReadOnlySpan over it, marshal-as read/write Span<char> over same copy. Return modified underlying string.
/// Performance: ~20-30% faster than fasterMap.
let spanCopy (mapper: char -> char) (str: string) =
let len = String.length str
let resultStr = String.Copy str
let roSpan = str.AsSpan()
let writeSpan = MemoryMarshal.CreateSpan(&MemoryMarshal.GetReference roSpan, len)
// loop-unroll
let mutable i = 0
while i < len - len % 4 do
writeSpan.[i] <- mapper roSpan.[i]
writeSpan.[i + 1] <- mapper roSpan.[i + 1]
writeSpan.[i + 2] <- mapper roSpan.[i + 2]
writeSpan.[i + 3] <- mapper roSpan.[i + 3]
i <- i + 4
// process the remainder
while i < len do
writeSpan.[i] <- mapper roSpan.[i]
i <- i + 1
resultStr
/// No unroll, uses String.Create with Span.
/// Performance: 0-20% faster than fasterMap, but fast only on larger (64kb) input
let spanCreate1 (mapper: char -> char) (str: string) =
let len = String.length str
String.Create(len, str, fun buf v ->
let mutable i = 0
for c in v do
buf.[i] <- mapper c
i <- i + 1)
/// No unroll, uses String.Create with Span, no closure
/// Performance: 0-22% faster than fasterMap, the larger the intput, the bigger the gain (sometimes slower!)
let spanCreate2 (mapper: char -> char) (str: string) =
let len = String.length str
String.Create(len, (str, mapper), fun buf (str, f) ->
let mutable i = 0
for c in str do
buf.[i] <- f c
i <- i + 1)
/// Unrollx4, uses String.Create with Span. All indexed operations on spans.
/// Performance: 15-32% (!) faster than fasterMap, the larger the input, the bigger the gain
let spanCreateUnrollx4 (mapper: char -> char) (str: string) =
let len = String.length str
String.Create(len, (str, mapper), fun target (str, f) ->
let mutable i = 0
let source = str.AsSpan()
while i < len - len % 4 do
target.[i] <- f source.[i]
target.[i + 1] <- f source.[i + 1]
target.[i + 2] <- f source.[i + 2]
target.[i + 3] <- f source.[i + 3]
i <- i + 4
// process the remainder
while i < len do
target.[i] <- f source.[i]
i <- i + 1) |
@dsyme or @cartermp, can I get a confirmation that using There are already dramatic perf improvements using other methods, but I know |
Yes, spans are off-limits (unless we decide to move FSharp.Core to also bring in a dependency on System.Memory). I'd actually love to do that but I know that @dsyme is quite allergic to them. I think we should strive for consistency in implementation here. Array and list internals tend not to use pointers to accomplish things. I'd like for us to eek out as much perf and memory as possible if we can, but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety. So the second-best implementation is probably sufficient, and a clear upgrade anyways. |
If, for (which in turn gives us also
Agreed. Yeah, Ok, I'll move forward without the "extreme" optimizations than. |
Next up: TLDR conclusions:
I think these numbers can be expected considering that Below: bold is fastest. I didn't select a champion if it was within margins of error. BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
[Host] : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
Job-FRSYQM : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
MaxRelativeError=0.02 Runtime=.NET Core 3.1 MaxIterationCount=10
MaxWarmupIterationCount=5 MinIterationCount=5 MinWarmupIterationCount=2
Big inputs, 819200 items, 4874240 chars total, ~9.3MB
The source code: let private concatArray sep (strings: string []) =
match String.length sep with
| 0 -> String.Concat strings
| 1 -> String.Join(sep.[0], strings, 0, strings.Length)
| _ -> String.Join(sep, strings, 0, strings.Length)
let concatOpt sep (strings: seq<string>) =
match strings with
| :? array<string> as arr -> concatArray sep arr
| :? list<string> as lst -> lst |> List.toArray |> concatArray sep
| _ ->
// special-casing Seq for sep = String.Empty is *not* faster due to the involved
// overhead of Seq.toArray and the already well-optimized String.Join
String.Join(sep, strings) |
Next up:
|
|
|
Are you planning to send fixes? |
@forki, absolutely, as soon as the string functions are done (one left), I'll send in a PR. I figured it made more sense to first do the exercise, timings etc, so that the reasoning behind the fixes, which aren't always obvious, are at least logged. |
Please do it as separate PRs so that it's easier to get those accepted
Abel Braaksma <notifications@github.com> schrieb am Di., 16. Juni 2020,
20:00:
… @forki <https://github.com/forki>, absolutely, as soon as the string
functions are done (one left), I'll send in a PR. I figured it made more
sense to first do the exercise, so that the reasoning behind the fixes,
which aren't always obvious, are at least logged.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9390 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAOANELTPMSWGDOGWRTFATRW6XLVANCNFSM4NR6H2KA>
.
|
@forki I was figuring to do one PR for the string stuff (others for array or whatever else comes my way), which is really just a few lines in a single file, but if you think it is better to do it function-by-function, I can do that, but I thought that would make it harder instead of easier... @cartermp, @TIHan, @dsyme, any preference here so that I don't have to do the PR's twice? |
From experience in this repo I can say that bigger performance PRs will
sometimes have a hard time to get in.
Abel Braaksma <notifications@github.com> schrieb am Di., 16. Juni 2020,
20:05:
… Please do it as separate PRs so that it's easier to get those accepted
I was figuring to do one PR for the string stuff, which is really just a
few lines in a single file
<https://github.com/dotnet/fsharp/blob/master/src/fsharp/FSharp.Core/string.fs>,
but if you think it is better to do it function-by-function, I can do that,
but I thought that would make it harder instead of easier...
@cartermp <https://github.com/cartermp>, @TIHan <https://github.com/TIHan>,
@dsyme <https://github.com/dsyme>, any preference here so that I don't
have to do the PR's twice?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9390 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAOANA5G7FVK3G2I2ZPV4DRW6X7LANCNFSM4NR6H2KA>
.
|
Yep, the smaller/more contained the better |
Ok, got it, I'll make a separate pr for each function then,and I'll link it to the relevant comment. |
@KevinRansom, an example that uses buffered |
@abelbraaksma , thanks I will look at it, and try to understand why. |
@abelbraaksma how far along do you think this one is in terms of getting to closure? |
@cartermp, I think there are quite a few improvements already, but I've yet to check in the array functions improvements. I might get another stab at it during the holidays (won't be going anywhere anyway, it's still 2020:)) |
I'm going to close this out - @abelbraaksma This was an amazing analysis - can you add new PRs or issues for any other specific small improvements you still see as possible? |
@dsyme, sure! I forgot about this one, the thread is still useful for other areas of optimization, but I agree that they should go into smaller, more containable issues and PRs. |
I've noticed for some time now that quite a few functions that are called in many places are suboptimal in terms of performance. For instance,
Array.create
,Array.init
,Array.concat
andArray.filter
can be improved upon with sometimes about 1.2-1.5x the speed, sometimes even manyfold (I gotArray.replicate times source |> Array.concat
25X faster with a dedicatedArray.repeat
, so that was a bit of cheating).Sometimes such improvements relate specifically to large arrays, sometimes specifically to small ones. In line of that,
String.xxx
functions that useStringBuilder
when replaced withToCharArray()
instead improve by about 20-30%.Before I start making a PR, which would certainly include showing the timings with BenchmarkDotNet for a variety of inputs, would such a PR be welcome? While nothing I tried is really complex, and what I tried is provably correct, as always with optimizations, they will increase the complexity of the code.
To get some idea, here's
String.map
vs a very simple optimization ofstr.ToCharArray() |> Array.map mapper |> String
, which in many cases wins 40% (needs more input lengths comparisons, though, this was with 12-char string):The text was updated successfully, but these errors were encountered: