-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CompilerPerf] [WIP] Optimizations on Map and Set (post #5307) #5360
Conversation
Gist: https://gist.github.com/manofstick/275fe8ed62091aec52cd382548719f2a Time comparison between post-#5307 and this PR (this is the affect of removal of MapOne)
(edit: updated times after inlining) |
Gist: https://gist.github.com/manofstick/e97dc9775bf01fd22b2f238cac9f1c27 This gist was to demonstrate performance improvement in creation of small-to-mid sized Map via
|
I’m not sure MS is gonna merge such PRs anytime soon (in 2 years at least). |
According to @cartermp "perf" is no. 1 theme for now. I think they will
merge it.
That said IIRC there are 2 map implementations here and 2 sets. I guess all
4 need these changes
Vasily Kirichenko <notifications@github.com> schrieb am Sa., 21. Juli 2018,
08:41:
… I’m not sure MS is gonna merge such PRs anytime soon (in 2 years at least).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5360 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADgNJx9NscMKAFd6L0i_9PZys055POVks5uIs0EgaJpZM4VZOOl>
.
|
Meh... they do, they don't. Not really any skin off my nose. I just so this stuff because I find sukuku boring... |
Everything that improves perf on map, will improve compiler perf (given that it also applies to that second internal map implementation). It's great to see it happen. |
Then OK :) |
src/fsharp/FSharp.Core/map.fs
Outdated
if t2h > t1h + 2 then (* right is heavier than left *) | ||
let t1h = size t1 | ||
let t2h = size t2 | ||
if (t2h >>> 1) > t1h then (* right is heavier than left *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You changed from -2 to /2 is this intended? Is it because we now control size instead of height? If so then the identifier should be renamed to reflect that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do removing of MapOne and move from controlling of height to size in a separate pull request.
src/fsharp/FSharp.Core/map.fs
Outdated
let size x = | ||
match x with | ||
| MapEmpty -> 0 | ||
| MapNode (_,_,_,_,h) -> h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rename h here to size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also make it inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently all this is hanging off the end of #5307. Maybe I should rebranch off current master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manofstick if possible, this would be best, yes. We can evaluate this separately which makes it easier to review, test, and incorporate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cartemp
Happy to do so, if you are happy to trade memory for performance (which is what the removal of MapOne means, as per PRs text). I mean I'm happy to spend time doing this as long as it's not completely my wasted time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, we can think about it. From the standpoint of making it easier to merge, having this be entirely separate is best, but if doing so is too difficult (or at least wouldn't make much sense without the others) then I think it's fine to keep working as-is.
Since #5307 is marked as approved for F# vNext, I've tagged it with the dev16.0 milestone, this could just as well stay as it is today and be considered in the context of #5307. It's just that our evaluation of it will also depend on that, and it is generally less likely for something to come in if it takes a dependency on something else coming in. But I think that risk is smaller given that the dependent branch is on the path towards approval.
Just to note I've already marked the three PRs that this one depends on as "Candidate for F# vNext". I'll do the same with this one. The first of the PRs is basically certain to go in, on last review it was exactly where I want it to be. Obviously each one is progressively less likely due to the compounding nature of likelihood, but the perf results in this one are excellent and consistent. |
Impressive work and good performance testing. |
...and let's now tests against the competition! Round 1: Creation of treesTested vs This test is building the associative container 1 integer at a time (and sanity checks that one of the elements exists in the container). I provide data in "Stripes" or Randomly. "Striped" data is chunks of ordered data. Both random and Striped are run under 10 different seed values (same seed set used for both containers). Each run is from unique from a cold start where the size and seed are passed as parameters on the command line. I also provide two types of objects where the CompareTo Functionality is either fast (with an The gist, including driver and raw results, is here. We see that with the "Slow" Comparison Type we are slightly slower that ImmutableSortedDictionary, but we are stable across Sizes. With a "Fast" Compariosn Type we are markable faster - especially on partially ordered data (i.e. in "Stripes").
|
Round 2: Finding elementsSlightly modified from Round 1, this version just creates the tree once, and then continually queries all the elements. The gist is here. Start to get worse behaviour under 64-bit for large collections.
|
Argh, stuffed up these tests. Will fix soon... ish... |
Set comparison test Gist: https://gist.github.com/manofstick/20ac739756c2063a0a2910e388622866
|
@buybackoff are you going to send a Pull request? |
what's this |
Only if it will be merged. This PR is open so I do not understand how it will work if I send another one. I believe updating only the layout and keeping height instead of size will be a quite straightforward single file change with near 2x perf boost for one of the most important collection, so it's very tempting to do that myself. If it will take many days or weeks of discussing anything else than low hanging major performance win I will rather pass. I only need ISeries members, most important of which are not a part of F# Map. I will likely rewrite the whole thing in highly optimized C# if there is no clear path to merge this new layout in this repo. If I change the code back to using
|
Absolutely zero-cost cast. Allows to work with an instance as if it is from a different type. Slow At IL level it's just |
I was boxing KeyComparer on every operation, so non-zero GC/memory. Actually that case is 40% faster and GC/mem is zero.
The case with default comparer just becomes
The difference between 46 and 58 MB is expected since the number of leaves and nodes is approximately equal for balanced tree. |
Haven't looked at this in detail, but from what I can see at a cursory glance is that you're not using the LanguagePrimitives comparer. This was the whole point of #5307 and it's predecessors, which this PR is really just finalizing. Also in the code as it is in this PR, the single value discriminated union means and some null checks also should be (basically) same performance as the change to the class. I had done some testing at the time and found negligible differences. Let me know if I'm misreading your contribution. Thanks, |
Memory usage. It's the worst thing about trees and you proposed to make it worse for no particular reason. In my original comment I quoted you, but not fully:
Everything else is similar to S.C.I.ISD and single-case wide DU from this PR, as I wrote in the last comment above. Another point is that your changes are so big that merging it will take time and is blocked on 5307, while there are three independent changes:
They could be in any order and released incrementally. Changing layout is the simplest of them with quick performance gain. |
% wise for the complete objects it's small. If you are worried about memory then you're using the wrong data structure.
Well just go and create a different PR then. I don't think any comments here do anything but muddy the already murky water of this PR chain? |
Actually I need this data structure for structural sharing to save memory and void copying large data when I need maps at different timestamps with little changes. There is no real-world reason to use trees as an |
OK, I hadn't looked properly. Leaves are smaller type, and using inheritance for other nodes. OK, cool, sounds good. |
The overhead could be smaller if StructLayout attribute is enabled / applied to DU (Please note, classes are also allowed to have StructLayout attribute and field layout attributes): |
Why? Additional overhead to leaves is |
If you pack the fields (Pack=1 attribute) there will be no padding. Less memory consumption in exchange for (a bit) ~less efficient data access. Also unless the layout is fixed and pack=1 is specified the JIT free to reorder the fields and change the padding, so using the Unsafe.As with may give some surprising result. |
A smaller let rec find (comparer: KeyComparer<'K>) k (m:MapTree<'K,'b>) =
if isEmpty m then raise (System.Collections.Generic.KeyNotFoundException())
else
let c = comparer.Compare(k,m.Key)
if c = 0 then m.Value
else
match m with
| :? MapTreeNode<'K,'b> as mn ->
find comparer k (if c < 0 then mn.Left else mn.Right)
| _ -> raise (System.Collections.Generic.KeyNotFoundException())
(The benchmark is adding/getting 1M The last two lines in the table show only the effect of layout change, the first line is with inlined |
.NET guarantees field alignment so that access to them is atomic (when < word size). When possible to pack without breaking pointer alignment .NET does just right thing. Using https://github.com/SergeyTeplyakov/ObjectLayoutInspector for Type layout for 'MapTreeNode`2'
Size: 24 bytes. Paddings: 0 bytes (%0 of empty space)
|===================================|
| Object Header (8 bytes) |
|-----------------------------------|
| Method Table Ptr (8 bytes) |
|===================================|
| 0-1: Int16 Key@ (2 bytes) |
|-----------------------------------|
| 2-3: Int16 Value@ (2 bytes) |
|-----------------------------------|
| 4-7: Int32 Height@ (4 bytes) |
|-----------------------------------|
| 8-15: MapTree`2 Left@ (8 bytes) |
|-----------------------------------|
| 16-23: MapTree`2 Right@ (8 bytes) |
|===================================| Note that Height us moved above pointers. Same is true for Type layout for 'MapTreeNode`2'
Size: 32 bytes. Paddings: 0 bytes (%0 of empty space)
|===================================|
| Object Header (8 bytes) |
|-----------------------------------|
| Method Table Ptr (8 bytes) |
|===================================|
| 0-7: Int64 Key@ (8 bytes) |
|-----------------------------------|
| 8-11: Int32 Value@ (4 bytes) |
|-----------------------------------|
| 12-15: Int32 Height@ (4 bytes) |
|-----------------------------------|
| 16-23: MapTree`2 Left@ (8 bytes) |
|-----------------------------------|
| 24-31: MapTree`2 Right@ (8 bytes) |
|===================================|
Tight packing for structs only works when there are no reference type fields. |
@manofstick , @dsyme -- please don't think I am picking on you :-) do we want to pursue this PR? the last commit was nearly two years ago. I am doing my best to get the number of PR's open on the repo down to a reasonable number of "owned -- active PRs" Thanks Kevin |
@KevinRansom - Meh, I think there is zero interest, so killing it. |
@manofstick, certainly not zero. These and other perf improvements looked really promising. I think recently, some of the ideas about optimizing generic comparisons for structs like DateTimes were restarted, though that seemed more like cherry picking as opposed to your general solution here. It would be nice if we could resurrect this, and related PRs, somehow. |
I think setting a goal and driving towards that one thing may be the way forward. These PR's were very broad and difficult to get ones head around. I know @TIHan is similarly frustrated, by the fact that we let expediency drive our development rather than a big rethink. Anyway, I am sorry to see these PR's go, but the repo has got very messy over the last couple of years with orphaned PR's. Usually the first page turns over pretty rapidly but if a PR migrates to the second or third page, it is pretty much toast for attention span. |
@KevinRansom Yes. Perhaps we need some way to champion certain PRs that have a broad benefit to the community, and/or are nearly complete and just need that extra push. I think about like what @cartermp was doing with the three-weekly road map updates in a pinned issue. Or somewhat like C# is doing with the championing, but then for PRs. Otherwise we risk losing some of those excellent community participations. |
Following discussion and POC code from dotnet#5360 (comment) Changes are very straightforward and do not touch public API: * Performance improves by a huge margin * Code size is smaller or same * Memory is same * No low level tricks, just simple code (see `asNode` comments for potential micro-optimizations, which are not visible after all; these comments are to be deleted) Benchmarks code is here: https://github.com/buybackoff/fsharp-benchmarks | Method | Job | BuildConfiguration | Size | Mean | Error | StdDev | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size | |------------ |------- |------------------- |--------- |------------------:|-----------------:|-----------------:|-----:|-----------:|---------:|--------:|------------:|----------:| | getItem | After | LocalBuild | 100 | 36.21 ns | 0.199 ns | 0.167 ns | 1 | - | - | - | - | 126 B | | getItem | Before | Default | 100 | 62.51 ns | 0.143 ns | 0.127 ns | 2 | - | - | - | - | 126 B | | getItem | After | LocalBuild | 10000 | 76.57 ns | 0.140 ns | 0.124 ns | 3 | - | - | - | - | 126 B | | getItem | Before | Default | 10000 | 120.02 ns | 0.182 ns | 0.170 ns | 4 | - | - | - | - | 126 B | | getItem | After | LocalBuild | 10000000 | 129.45 ns | 0.126 ns | 0.118 ns | 5 | - | - | - | - | 126 B | | getItem | Before | Default | 10000000 | 209.35 ns | 0.496 ns | 0.464 ns | 6 | - | - | - | - | 126 B | | | | | | | | | | | | | | | | containsKey | After | LocalBuild | 100 | 35.63 ns | 0.201 ns | 0.188 ns | 1 | - | - | - | - | 177 B | | containsKey | Before | Default | 100 | 64.01 ns | 0.351 ns | 0.328 ns | 2 | - | - | - | - | 276 B | | containsKey | After | LocalBuild | 10000 | 65.63 ns | 0.150 ns | 0.125 ns | 3 | - | - | - | - | 177 B | | containsKey | Before | Default | 10000 | 123.82 ns | 0.149 ns | 0.139 ns | 5 | - | - | - | - | 276 B | | containsKey | After | LocalBuild | 10000000 | 95.05 ns | 0.082 ns | 0.072 ns | 4 | - | - | - | - | 177 B | | containsKey | Before | Default | 10000000 | 204.39 ns | 0.338 ns | 0.282 ns | 6 | - | - | - | - | 276 B | | | | | | | | | | | | | | | | itemCount | After | LocalBuild | 100 | 231.39 ns | 0.406 ns | 0.360 ns | 1 | - | - | - | - | 96 B | | itemCount | Before | Default | 100 | 539.74 ns | 1.923 ns | 1.798 ns | 2 | - | - | - | - | 151 B | | itemCount | After | LocalBuild | 10000 | 33,160.50 ns | 194.709 ns | 182.131 ns | 3 | - | - | - | - | 96 B | | itemCount | Before | Default | 10000 | 63,074.34 ns | 138.682 ns | 129.724 ns | 4 | - | - | - | - | 151 B | | itemCount | After | LocalBuild | 10000000 | 62,332,911.90 ns | 252,973.481 ns | 224,254.402 ns | 5 | - | - | - | 148 B | 96 B | | itemCount | Before | Default | 10000000 | 94,745,625.56 ns | 205,640.690 ns | 192,356.429 ns | 6 | - | - | - | - | 151 B | | | | | | | | | | | | | | | | iterForeach | After | LocalBuild | 100 | 3,355.75 ns | 9.540 ns | 7.448 ns | 1 | 0.9727 | - | - | 6120 B | 291 B | | iterForeach | Before | Default | 100 | 3,866.56 ns | 10.148 ns | 8.996 ns | 2 | 0.9689 | - | - | 6120 B | 291 B | | iterForeach | After | LocalBuild | 10000 | 348,359.43 ns | 1,148.753 ns | 959.260 ns | 3 | 95.2148 | - | - | 600120 B | 291 B | | iterForeach | Before | Default | 10000 | 398,419.61 ns | 513.959 ns | 480.758 ns | 4 | 95.2148 | - | - | 600120 B | 291 B | | iterForeach | After | LocalBuild | 10000000 | 391,889,200.00 ns | 1,604,306.946 ns | 1,500,669.712 ns | 5 | 95000.0000 | - | - | 600000120 B | 321 B | | iterForeach | Before | Default | 10000000 | 445,099,028.57 ns | 1,380,498.715 ns | 1,223,776.153 ns | 6 | 95000.0000 | - | - | 600000120 B | 321 B | | | | | | | | | | | | | | | | addItem | After | LocalBuild | 100 | 181.25 ns | 0.961 ns | 0.899 ns | 1 | 0.0586 | 0.0003 | - | 369 B | 621 B | | addItem | Before | Default | 100 | 311.85 ns | 0.601 ns | 0.562 ns | 2 | 0.0586 | - | - | 369 B | 697 B | | addItem | After | LocalBuild | 10000 | 40,893.49 ns | 174.683 ns | 163.398 ns | 3 | 11.0156 | 3.2813 | - | 69324 B | 621 B | | addItem | Before | Default | 10000 | 71,746.33 ns | 130.309 ns | 121.891 ns | 4 | 11.0156 | 3.3594 | - | 69324 B | 697 B | | addItem | After | LocalBuild | 10000000 | 87,178,251.47 ns | 250,148.324 ns | 233,988.898 ns | 5 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B | 621 B | | addItem | Before | Default | 10000000 | 146,799,424.80 ns | 286,531.195 ns | 268,021.458 ns | 6 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B | 697 B | | | | | | | | | | | | | | | | removeItem | After | LocalBuild | 100 | 13.64 ns | 0.112 ns | 0.105 ns | 1 | 0.0064 | - | - | 40 B | 469 B | | removeItem | Before | Default | 100 | 16.38 ns | 0.071 ns | 0.067 ns | 2 | 0.0064 | - | - | 40 B | 519 B | | removeItem | After | LocalBuild | 10000 | 1,329.24 ns | 9.087 ns | 8.056 ns | 3 | 0.6372 | - | - | 4000 B | 469 B | | removeItem | Before | Default | 10000 | 1,607.21 ns | 5.566 ns | 5.206 ns | 4 | 0.6372 | - | - | 4000 B | 519 B | | removeItem | After | LocalBuild | 10000000 | 1,232,230.00 ns | 6,303.414 ns | 5,896.218 ns | 5 | 630.0000 | - | - | 4000000 B | 469 B | | removeItem | Before | Default | 10000000 | 1,801,088.33 ns | 8,945.674 ns | 8,367.789 ns | 6 | 630.0000 | - | - | 4000000 B | 519 B |
* FSharp.Core: Map: optimize tree layout Following discussion and POC code from #5360 (comment) Changes are very straightforward and do not touch public API: * Performance improves by a huge margin * Code size is smaller or same * Memory is same * No low level tricks, just simple code (see `asNode` comments for potential micro-optimizations, which are not visible after all; these comments are to be deleted) Benchmarks code is here: https://github.com/buybackoff/fsharp-benchmarks | Method | Job | BuildConfiguration | Size | Mean | Error | StdDev | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size | |------------ |------- |------------------- |--------- |------------------:|-----------------:|-----------------:|-----:|-----------:|---------:|--------:|------------:|----------:| | getItem | After | LocalBuild | 100 | 36.21 ns | 0.199 ns | 0.167 ns | 1 | - | - | - | - | 126 B | | getItem | Before | Default | 100 | 62.51 ns | 0.143 ns | 0.127 ns | 2 | - | - | - | - | 126 B | | getItem | After | LocalBuild | 10000 | 76.57 ns | 0.140 ns | 0.124 ns | 3 | - | - | - | - | 126 B | | getItem | Before | Default | 10000 | 120.02 ns | 0.182 ns | 0.170 ns | 4 | - | - | - | - | 126 B | | getItem | After | LocalBuild | 10000000 | 129.45 ns | 0.126 ns | 0.118 ns | 5 | - | - | - | - | 126 B | | getItem | Before | Default | 10000000 | 209.35 ns | 0.496 ns | 0.464 ns | 6 | - | - | - | - | 126 B | | | | | | | | | | | | | | | | containsKey | After | LocalBuild | 100 | 35.63 ns | 0.201 ns | 0.188 ns | 1 | - | - | - | - | 177 B | | containsKey | Before | Default | 100 | 64.01 ns | 0.351 ns | 0.328 ns | 2 | - | - | - | - | 276 B | | containsKey | After | LocalBuild | 10000 | 65.63 ns | 0.150 ns | 0.125 ns | 3 | - | - | - | - | 177 B | | containsKey | Before | Default | 10000 | 123.82 ns | 0.149 ns | 0.139 ns | 5 | - | - | - | - | 276 B | | containsKey | After | LocalBuild | 10000000 | 95.05 ns | 0.082 ns | 0.072 ns | 4 | - | - | - | - | 177 B | | containsKey | Before | Default | 10000000 | 204.39 ns | 0.338 ns | 0.282 ns | 6 | - | - | - | - | 276 B | | | | | | | | | | | | | | | | itemCount | After | LocalBuild | 100 | 231.39 ns | 0.406 ns | 0.360 ns | 1 | - | - | - | - | 96 B | | itemCount | Before | Default | 100 | 539.74 ns | 1.923 ns | 1.798 ns | 2 | - | - | - | - | 151 B | | itemCount | After | LocalBuild | 10000 | 33,160.50 ns | 194.709 ns | 182.131 ns | 3 | - | - | - | - | 96 B | | itemCount | Before | Default | 10000 | 63,074.34 ns | 138.682 ns | 129.724 ns | 4 | - | - | - | - | 151 B | | itemCount | After | LocalBuild | 10000000 | 62,332,911.90 ns | 252,973.481 ns | 224,254.402 ns | 5 | - | - | - | 148 B | 96 B | | itemCount | Before | Default | 10000000 | 94,745,625.56 ns | 205,640.690 ns | 192,356.429 ns | 6 | - | - | - | - | 151 B | | | | | | | | | | | | | | | | iterForeach | After | LocalBuild | 100 | 3,355.75 ns | 9.540 ns | 7.448 ns | 1 | 0.9727 | - | - | 6120 B | 291 B | | iterForeach | Before | Default | 100 | 3,866.56 ns | 10.148 ns | 8.996 ns | 2 | 0.9689 | - | - | 6120 B | 291 B | | iterForeach | After | LocalBuild | 10000 | 348,359.43 ns | 1,148.753 ns | 959.260 ns | 3 | 95.2148 | - | - | 600120 B | 291 B | | iterForeach | Before | Default | 10000 | 398,419.61 ns | 513.959 ns | 480.758 ns | 4 | 95.2148 | - | - | 600120 B | 291 B | | iterForeach | After | LocalBuild | 10000000 | 391,889,200.00 ns | 1,604,306.946 ns | 1,500,669.712 ns | 5 | 95000.0000 | - | - | 600000120 B | 321 B | | iterForeach | Before | Default | 10000000 | 445,099,028.57 ns | 1,380,498.715 ns | 1,223,776.153 ns | 6 | 95000.0000 | - | - | 600000120 B | 321 B | | | | | | | | | | | | | | | | addItem | After | LocalBuild | 100 | 181.25 ns | 0.961 ns | 0.899 ns | 1 | 0.0586 | 0.0003 | - | 369 B | 621 B | | addItem | Before | Default | 100 | 311.85 ns | 0.601 ns | 0.562 ns | 2 | 0.0586 | - | - | 369 B | 697 B | | addItem | After | LocalBuild | 10000 | 40,893.49 ns | 174.683 ns | 163.398 ns | 3 | 11.0156 | 3.2813 | - | 69324 B | 621 B | | addItem | Before | Default | 10000 | 71,746.33 ns | 130.309 ns | 121.891 ns | 4 | 11.0156 | 3.3594 | - | 69324 B | 697 B | | addItem | After | LocalBuild | 10000000 | 87,178,251.47 ns | 250,148.324 ns | 233,988.898 ns | 5 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B | 621 B | | addItem | Before | Default | 10000000 | 146,799,424.80 ns | 286,531.195 ns | 268,021.458 ns | 6 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B | 697 B | | | | | | | | | | | | | | | | removeItem | After | LocalBuild | 100 | 13.64 ns | 0.112 ns | 0.105 ns | 1 | 0.0064 | - | - | 40 B | 469 B | | removeItem | Before | Default | 100 | 16.38 ns | 0.071 ns | 0.067 ns | 2 | 0.0064 | - | - | 40 B | 519 B | | removeItem | After | LocalBuild | 10000 | 1,329.24 ns | 9.087 ns | 8.056 ns | 3 | 0.6372 | - | - | 4000 B | 469 B | | removeItem | Before | Default | 10000 | 1,607.21 ns | 5.566 ns | 5.206 ns | 4 | 0.6372 | - | - | 4000 B | 519 B | | removeItem | After | LocalBuild | 10000000 | 1,232,230.00 ns | 6,303.414 ns | 5,896.218 ns | 5 | 630.0000 | - | - | 4000000 B | 469 B | | removeItem | Before | Default | 10000000 | 1,801,088.33 ns | 8,945.674 ns | 8,367.789 ns | 6 | 630.0000 | - | - | 4000000 B | 519 B | * Simplify node ctors * FSharp.Core: Map: delete notes in asNode * FSharp.Core: Map: fix typo in spliceOutSuccessor * FSharp.Core: Map: remove unused open
* FSharp.Core: Map: optimize tree layout Following discussion and POC code from dotnet#5360 (comment) Changes are very straightforward and do not touch public API: * Performance improves by a huge margin * Code size is smaller or same * Memory is same * No low level tricks, just simple code (see `asNode` comments for potential micro-optimizations, which are not visible after all; these comments are to be deleted) Benchmarks code is here: https://github.com/buybackoff/fsharp-benchmarks | Method | Job | BuildConfiguration | Size | Mean | Error | StdDev | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size | |------------ |------- |------------------- |--------- |------------------:|-----------------:|-----------------:|-----:|-----------:|---------:|--------:|------------:|----------:| | getItem | After | LocalBuild | 100 | 36.21 ns | 0.199 ns | 0.167 ns | 1 | - | - | - | - | 126 B | | getItem | Before | Default | 100 | 62.51 ns | 0.143 ns | 0.127 ns | 2 | - | - | - | - | 126 B | | getItem | After | LocalBuild | 10000 | 76.57 ns | 0.140 ns | 0.124 ns | 3 | - | - | - | - | 126 B | | getItem | Before | Default | 10000 | 120.02 ns | 0.182 ns | 0.170 ns | 4 | - | - | - | - | 126 B | | getItem | After | LocalBuild | 10000000 | 129.45 ns | 0.126 ns | 0.118 ns | 5 | - | - | - | - | 126 B | | getItem | Before | Default | 10000000 | 209.35 ns | 0.496 ns | 0.464 ns | 6 | - | - | - | - | 126 B | | | | | | | | | | | | | | | | containsKey | After | LocalBuild | 100 | 35.63 ns | 0.201 ns | 0.188 ns | 1 | - | - | - | - | 177 B | | containsKey | Before | Default | 100 | 64.01 ns | 0.351 ns | 0.328 ns | 2 | - | - | - | - | 276 B | | containsKey | After | LocalBuild | 10000 | 65.63 ns | 0.150 ns | 0.125 ns | 3 | - | - | - | - | 177 B | | containsKey | Before | Default | 10000 | 123.82 ns | 0.149 ns | 0.139 ns | 5 | - | - | - | - | 276 B | | containsKey | After | LocalBuild | 10000000 | 95.05 ns | 0.082 ns | 0.072 ns | 4 | - | - | - | - | 177 B | | containsKey | Before | Default | 10000000 | 204.39 ns | 0.338 ns | 0.282 ns | 6 | - | - | - | - | 276 B | | | | | | | | | | | | | | | | itemCount | After | LocalBuild | 100 | 231.39 ns | 0.406 ns | 0.360 ns | 1 | - | - | - | - | 96 B | | itemCount | Before | Default | 100 | 539.74 ns | 1.923 ns | 1.798 ns | 2 | - | - | - | - | 151 B | | itemCount | After | LocalBuild | 10000 | 33,160.50 ns | 194.709 ns | 182.131 ns | 3 | - | - | - | - | 96 B | | itemCount | Before | Default | 10000 | 63,074.34 ns | 138.682 ns | 129.724 ns | 4 | - | - | - | - | 151 B | | itemCount | After | LocalBuild | 10000000 | 62,332,911.90 ns | 252,973.481 ns | 224,254.402 ns | 5 | - | - | - | 148 B | 96 B | | itemCount | Before | Default | 10000000 | 94,745,625.56 ns | 205,640.690 ns | 192,356.429 ns | 6 | - | - | - | - | 151 B | | | | | | | | | | | | | | | | iterForeach | After | LocalBuild | 100 | 3,355.75 ns | 9.540 ns | 7.448 ns | 1 | 0.9727 | - | - | 6120 B | 291 B | | iterForeach | Before | Default | 100 | 3,866.56 ns | 10.148 ns | 8.996 ns | 2 | 0.9689 | - | - | 6120 B | 291 B | | iterForeach | After | LocalBuild | 10000 | 348,359.43 ns | 1,148.753 ns | 959.260 ns | 3 | 95.2148 | - | - | 600120 B | 291 B | | iterForeach | Before | Default | 10000 | 398,419.61 ns | 513.959 ns | 480.758 ns | 4 | 95.2148 | - | - | 600120 B | 291 B | | iterForeach | After | LocalBuild | 10000000 | 391,889,200.00 ns | 1,604,306.946 ns | 1,500,669.712 ns | 5 | 95000.0000 | - | - | 600000120 B | 321 B | | iterForeach | Before | Default | 10000000 | 445,099,028.57 ns | 1,380,498.715 ns | 1,223,776.153 ns | 6 | 95000.0000 | - | - | 600000120 B | 321 B | | | | | | | | | | | | | | | | addItem | After | LocalBuild | 100 | 181.25 ns | 0.961 ns | 0.899 ns | 1 | 0.0586 | 0.0003 | - | 369 B | 621 B | | addItem | Before | Default | 100 | 311.85 ns | 0.601 ns | 0.562 ns | 2 | 0.0586 | - | - | 369 B | 697 B | | addItem | After | LocalBuild | 10000 | 40,893.49 ns | 174.683 ns | 163.398 ns | 3 | 11.0156 | 3.2813 | - | 69324 B | 621 B | | addItem | Before | Default | 10000 | 71,746.33 ns | 130.309 ns | 121.891 ns | 4 | 11.0156 | 3.3594 | - | 69324 B | 697 B | | addItem | After | LocalBuild | 10000000 | 87,178,251.47 ns | 250,148.324 ns | 233,988.898 ns | 5 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B | 621 B | | addItem | Before | Default | 10000000 | 146,799,424.80 ns | 286,531.195 ns | 268,021.458 ns | 6 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B | 697 B | | | | | | | | | | | | | | | | removeItem | After | LocalBuild | 100 | 13.64 ns | 0.112 ns | 0.105 ns | 1 | 0.0064 | - | - | 40 B | 469 B | | removeItem | Before | Default | 100 | 16.38 ns | 0.071 ns | 0.067 ns | 2 | 0.0064 | - | - | 40 B | 519 B | | removeItem | After | LocalBuild | 10000 | 1,329.24 ns | 9.087 ns | 8.056 ns | 3 | 0.6372 | - | - | 4000 B | 469 B | | removeItem | Before | Default | 10000 | 1,607.21 ns | 5.566 ns | 5.206 ns | 4 | 0.6372 | - | - | 4000 B | 519 B | | removeItem | After | LocalBuild | 10000000 | 1,232,230.00 ns | 6,303.414 ns | 5,896.218 ns | 5 | 630.0000 | - | - | 4000000 B | 469 B | | removeItem | Before | Default | 10000000 | 1,801,088.33 ns | 8,945.674 ns | 8,367.789 ns | 6 | 630.0000 | - | - | 4000000 B | 519 B | * Simplify node ctors * FSharp.Core: Map: delete notes in asNode * FSharp.Core: Map: fix typo in spliceOutSuccessor * FSharp.Core: Map: remove unused open
It's a shame all these PRs been abandoned. I will be resurrecting them post .NET8 release (in november). Compiler perf should be among top priorities. |
Working off the base of #5307, where I had some rudimentary
Map
performance tests in this comment, I started to have a play around with the code to see if I could improve it.Map
is a three part union, which is the worst kind (for performance)! (NB: Although using an attribute to handle the nullary case makes it somewhat better) From memory I think they used to doGetType()
calls and compare those (I could be mistaken...) but anyway, in the current state they do type checks (isinst
), which are pretty fast, but not as fast as avoiding the checks altogether.This was some of my original motivation with #1517, which would have still required a virtual call, but then just a
switch
- i.e. half-way house to the >= 4-part union'sTag
, but at the time I was put off due to my belief that it would affect serialization (#1517 is stillopen
but I think it should just be closed given that the underlying code base has moved on). Upon further inspection though, I don't think serialization is a concern though, as theMapTree
union is internal, and serialization is handled through the flattening of the map into an array.Anyway, as far as
Map
is concerned, it can be represented as a two part-union case - removingMapOne
- leaving one of the remaining cases asnull
, which translates down to no type-casts, just the null check. This also simplifies the code somewhat - not having to deal with the leaf node as a special case. The down side is that it does increase memory for leaf-nodes by2*sizeof<ptr>+sizeof<int>
. But given that it'ssizeof<key>+sizeof<value>+object overhead
whereobject overhead
= 8 for 32-bit or 16 for 64-bit JIT I don't think this is a showstopper, but let me know (sooner rather than later would be good...) And actually this is howzmap
in the compiler is represented.Other improvements:
Map.count
to change from O(n) to O(1)ofSeq
,ofList
,ofArray