Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CompilerPerf] [WIP] Optimizations on Map and Set (post #5307) #5360

Closed
wants to merge 64 commits into from

Conversation

manofstick
Copy link
Contributor

@manofstick manofstick commented Jul 20, 2018

Working off the base of #5307, where I had some rudimentary Map performance tests in this comment, I started to have a play around with the code to see if I could improve it.

Map is a three part union, which is the worst kind (for performance)! (NB: Although using an attribute to handle the nullary case makes it somewhat better) From memory I think they used to do GetType() calls and compare those (I could be mistaken...) but anyway, in the current state they do type checks (isinst), which are pretty fast, but not as fast as avoiding the checks altogether.

This was some of my original motivation with #1517, which would have still required a virtual call, but then just a switch - i.e. half-way house to the >= 4-part union's Tag, but at the time I was put off due to my belief that it would affect serialization (#1517 is still open but I think it should just be closed given that the underlying code base has moved on). Upon further inspection though, I don't think serialization is a concern though, as the MapTree union is internal, and serialization is handled through the flattening of the map into an array.

Anyway, as far as Map is concerned, it can be represented as a two part-union case - removing MapOne - leaving one of the remaining cases as null, which translates down to no type-casts, just the null check. This also simplifies the code somewhat - not having to deal with the leaf node as a special case. The down side is that it does increase memory for leaf-nodes by 2*sizeof<ptr>+sizeof<int>. But given that it's sizeof<key>+sizeof<value>+object overhead where object overhead = 8 for 32-bit or 16 for 64-bit JIT I don't think this is a showstopper, but let me know (sooner rather than later would be good...) And actually this is how zmap in the compiler is represented.

Other improvements:

  • Converted height to size, thereby allowing Map.count to change from O(n) to O(1)
  • Better construction from ofSeq, ofList, ofArray
  • To be announced

@manofstick manofstick changed the title Optimizations on Map (post #5307) [WIP] Optimizations on Map (post #5307) Jul 20, 2018
@manofstick
Copy link
Contributor Author

manofstick commented Jul 21, 2018

Gist: https://gist.github.com/manofstick/275fe8ed62091aec52cd382548719f2a

Time comparison between post-#5307 and this PR (this is the affect of removal of MapOne)

bittage key type creation access
32-bit KeyRecord 77% 92%
32-bit KeyGenericRecord`1 84% 94%
32-bit KeyStruct 81% 90%
32-bit KeyGenericStruct`1 85% 92%
32-bit Tuple`3 86% 95%
32-bit ValueTuple`3 83% 90%
32-bit Int32 64% 78%
32-bit Int64 62% 81%
64-bit KeyRecord 84% 96%
64-bit KeyGenericRecord`1 84% 90%
64-bit KeyStruct 87% 91%
64-bit KeyGenericStruct`1 87% 90%
64-bit Tuple`3 92% 86%
64-bit ValueTuple`3 85% 90%
64-bit Int32 60% 77%
64-bit Int64 63% 71%

(edit: updated times after inlining)

@manofstick
Copy link
Contributor Author

Gist: https://gist.github.com/manofstick/e97dc9775bf01fd22b2f238cac9f1c27

This gist was to demonstrate performance improvement in creation of small-to-mid sized Map via ofSeq (equally ofList, ofArray) . The gist has a map of ~140 string names of colours. The created TreeMap after the modified ofSeq also has a better distribution which makes accesses to it faster as well.

test bittage remove OneNode ofSeq
construct 64-bit 41% 28%
construct 32-bit 67% 40%
access 64-bit 47% 32%
access 32-bit 72% 43%

@vasily-kirichenko
Copy link
Contributor

I’m not sure MS is gonna merge such PRs anytime soon (in 2 years at least).

@forki
Copy link
Contributor

forki commented Jul 21, 2018 via email

@manofstick
Copy link
Contributor Author

Meh... they do, they don't. Not really any skin off my nose. I just so this stuff because I find sukuku boring...

@forki
Copy link
Contributor

forki commented Jul 21, 2018

Everything that improves perf on map, will improve compiler perf (given that it also applies to that second internal map implementation). It's great to see it happen.

@vasily-kirichenko
Copy link
Contributor

Meh... they do, they don't. Not really any skin off my nose. I just so this stuff because I find sukuku boring...

Then OK :)

if t2h > t1h + 2 then (* right is heavier than left *)
let t1h = size t1
let t2h = size t2
if (t2h >>> 1) > t1h then (* right is heavier than left *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed from -2 to /2 is this intended? Is it because we now control size instead of height? If so then the identifier should be renamed to reflect that

Copy link
Contributor

@forki forki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do removing of MapOne and move from controlling of height to size in a separate pull request.

let size x =
match x with
| MapEmpty -> 0
| MapNode (_,_,_,_,h) -> h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename h here to size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also make it inline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently all this is hanging off the end of #5307. Maybe I should rebranch off current master?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manofstick if possible, this would be best, yes. We can evaluate this separately which makes it easier to review, test, and incorporate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cartemp

Happy to do so, if you are happy to trade memory for performance (which is what the removal of MapOne means, as per PRs text). I mean I'm happy to spend time doing this as long as it's not completely my wasted time...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, we can think about it. From the standpoint of making it easier to merge, having this be entirely separate is best, but if doing so is too difficult (or at least wouldn't make much sense without the others) then I think it's fine to keep working as-is.

Since #5307 is marked as approved for F# vNext, I've tagged it with the dev16.0 milestone, this could just as well stay as it is today and be considered in the context of #5307. It's just that our evaluation of it will also depend on that, and it is generally less likely for something to come in if it takes a dependency on something else coming in. But I think that risk is smaller given that the dependent branch is on the path towards approval.

@dsyme
Copy link
Contributor

dsyme commented Jul 23, 2018

Just to note I've already marked the three PRs that this one depends on as "Candidate for F# vNext". I'll do the same with this one.

The first of the PRs is basically certain to go in, on last review it was exactly where I want it to be. Obviously each one is progressively less likely due to the compounding nature of likelihood, but the perf results in this one are excellent and consistent.

@dsyme dsyme changed the title [WIP] Optimizations on Map (post #5307) [CompilerPerf] [WIP] Optimizations on Map (post #5307) Jul 23, 2018
@dsyme
Copy link
Contributor

dsyme commented Jul 23, 2018

Impressive work and good performance testing.

@TIHan TIHan added Tenet-Performance Area-Library Issues for FSharp.Core not covered elsewhere labels Jul 23, 2018
@manofstick
Copy link
Contributor Author

manofstick commented Jul 27, 2018

...and let's now tests against the competition!

Round 1: Creation of trees

Tested vs System.Collections.Immutable's ImmutableSortedDictionary (Version 1.5.0). Note that underlying both is just a binary tree, so we should, at some size, expect a constant percentage difference (i.e. ultimately the O(log N)). But due to ImmutableSortedDictionary using standard AVL logic and Map using the modified size algorithm, at smaller tree sizes we could expect some differences. For garbage collection check the gist's raw results where I have recorded the number of collections. Differences begin with the ImmutableSortedDictionary originally creating mutable nodes, and only "freezing" then after rebalancing. Map's nodes are always immutable.

This test is building the associative container 1 integer at a time (and sanity checks that one of the elements exists in the container).

I provide data in "Stripes" or Randomly. "Striped" data is chunks of ordered data. Both random and Striped are run under 10 different seed values (same seed set used for both containers). Each run is from unique from a cold start where the size and seed are passed as parameters on the command line.

I also provide two types of objects where the CompareTo Functionality is either fast (with an int) or slow (custom type that just burns cycles on comparison).

The gist, including driver and raw results, is here.

We see that with the "Slow" Comparison Type we are slightly slower that ImmutableSortedDictionary, but we are stable across Sizes. With a "Fast" Compariosn Type we are markable faster - especially on partially ordered data (i.e. in "Stripes").

Shape Comparison Type Size Immutable Dictionary Map Percent
Random Fast 1 2863 1434 50%
Random Fast 2 3386 1605 47%
Random Fast 3 4222 2005 47%
Random Fast 5 5191 2389 46%
Random Fast 7 5987 2717 45%
Random Fast 11 6967 3174 46%
Random Fast 17 8036 3615 45%
Random Fast 25 9301 4001 43%
Random Fast 38 10759 4425 41%
Random Fast 57 12165 4976 41%
Random Fast 86 13798 5662 41%
Random Fast 129 15590 6552 42%
Random Fast 194 17743 7576 43%
Random Fast 291 19772 8708 44%
Random Fast 437 21792 9818 45%
Random Fast 656 23583 10862 46%
Random Fast 985 25406 11847 47%
Random Fast 1477 27153 12943 48%
Random Fast 2216 29070 14144 49%
Random Fast 3325 31348 15695 50%
Random Fast 4987 34270 17156 50%
Random Fast 7481 37423 21330 57%
Random Fast 11222 40752 24078 59%
Random Fast 16834 44853 27693 62%
Random Fast 25251 48257 31445 65%
Random Fast 37876 55900 37681 67%
Random Fast 56815 63080 46147 73%
Random Fast 85222 71523 52722 74%
Random Fast 127834 80242 64918 81%
Random Fast 191751 96824 82062 85%
Random Fast 287626 100003 85404 85%
Random Fast 431439 105871 93223 88%
Random Fast 647159 110279 97758 89%
Random Slow 1 660 704 107%
Random Slow 2 692 710 103%
Random Slow 3 921 965 105%
Random Slow 5 1156 1198 104%
Random Slow 7 1393 1422 102%
Random Slow 11 1649 1739 105%
Random Slow 17 1901 1975 104%
Random Slow 25 2214 2230 101%
Random Slow 38 2493 2601 104%
Random Slow 57 2850 2956 104%
Random Slow 86 3137 3257 104%
Random Slow 129 3478 3578 103%
Random Slow 194 3763 3949 105%
Random Slow 291 4097 4262 104%
Random Slow 437 4474 4683 105%
Random Slow 656 4833 5027 104%
Random Slow 985 5142 5272 103%
Random Slow 1477 5261 5370 102%
Random Slow 2216 5773 5979 104%
Random Slow 3325 5864 6021 103%
Random Slow 4987 6606 6945 105%
Random Slow 7481 6368 6520 102%
Random Slow 11222 6627 6928 105%
Random Slow 16834 5290 5490 104%
Random Slow 25251 8289 8597 104%
Random Slow 37876 13117 13551 103%
Random Slow 56815 20500 21327 104%
Random Slow 85222 32135 33225 103%
Random Slow 127834 50642 52602 104%
Random Slow 191751 79568 82179 103%
Random Slow 287626 123849 128557 104%
Random Slow 431439 195671 201853 103%
Random Slow 647159 305343 316787 104%
Stripes Fast 1 2856 1514 53%
Stripes Fast 2 3431 1651 48%
Stripes Fast 3 4360 2140 49%
Stripes Fast 5 5548 2556 46%
Stripes Fast 7 6550 2870 44%
Stripes Fast 11 7707 3506 45%
Stripes Fast 17 9072 4011 44%
Stripes Fast 25 10281 4527 44%
Stripes Fast 38 11697 5081 43%
Stripes Fast 57 12953 5710 44%
Stripes Fast 86 14317 6374 45%
Stripes Fast 129 15583 6846 44%
Stripes Fast 194 16873 7570 45%
Stripes Fast 291 18131 8141 45%
Stripes Fast 437 19199 8737 46%
Stripes Fast 656 20384 9303 46%
Stripes Fast 985 21640 10025 46%
Stripes Fast 1477 22910 10747 47%
Stripes Fast 2216 24384 11686 48%
Stripes Fast 3325 25970 12906 50%
Stripes Fast 4987 28120 14377 51%
Stripes Fast 7481 30682 16354 53%
Stripes Fast 11222 33498 18861 56%
Stripes Fast 16834 36100 19899 55%
Stripes Fast 25251 37287 21505 58%
Stripes Fast 37876 39711 23095 58%
Stripes Fast 56815 41343 23891 58%
Stripes Fast 85222 43093 24781 58%
Stripes Fast 127834 44354 26347 59%
Stripes Fast 191751 48477 29410 61%
Stripes Fast 287626 46868 28105 60%
Stripes Fast 431439 43926 26631 61%
Stripes Fast 647159 41014 24992 61%
Stripes Slow 1 656 701 107%
Stripes Slow 2 682 691 101%
Stripes Slow 3 1062 1084 102%
Stripes Slow 5 1255 1306 104%
Stripes Slow 7 1479 1530 103%
Stripes Slow 11 1766 1820 103%
Stripes Slow 17 2080 2130 102%
Stripes Slow 25 2378 2476 104%
Stripes Slow 38 2731 2801 103%
Stripes Slow 57 3031 3096 102%
Stripes Slow 86 3406 3501 103%
Stripes Slow 129 3725 3835 103%
Stripes Slow 194 4061 4165 103%
Stripes Slow 291 4361 4493 103%
Stripes Slow 437 4738 4892 103%
Stripes Slow 656 5113 5228 102%
Stripes Slow 985 5395 5524 102%
Stripes Slow 1477 5507 5672 103%
Stripes Slow 2216 5970 6100 102%
Stripes Slow 3325 6084 6271 103%
Stripes Slow 4987 6869 7012 102%
Stripes Slow 7481 6504 6703 103%
Stripes Slow 11222 6869 6938 101%
Stripes Slow 16834 5379 5506 102%
Stripes Slow 25251 8415 8628 103%
Stripes Slow 37876 13081 13407 102%
Stripes Slow 56815 20326 20920 103%
Stripes Slow 85222 31584 32480 103%
Stripes Slow 127834 49174 50400 102%
Stripes Slow 191751 76471 78280 102%
Stripes Slow 287626 118653 122078 103%
Stripes Slow 431439 183274 190553 104%
Stripes Slow 647159 284078 294413 104%

@manofstick
Copy link
Contributor Author

manofstick commented Jul 28, 2018

Round 2: Finding elements

Slightly modified from Round 1, this version just creates the tree once, and then continually queries all the elements. The gist is here.

Start to get worse behaviour under 64-bit for large collections.

Bittage Comparer Type Data Type Size ImmutableSortedDictionary Map %
32-bit Fast Random 1 2636 1556 59%
32-bit Fast Random 2 3270 2081 64%
32-bit Fast Random 3 3576 2255 63%
32-bit Fast Random 5 4336 3354 77%
32-bit Fast Random 7 4783 3094 65%
32-bit Fast Random 11 5473 3517 64%
32-bit Fast Random 17 6257 3912 63%
32-bit Fast Random 25 6959 4326 62%
32-bit Fast Random 38 7724 4842 63%
32-bit Fast Random 57 8630 5347 62%
32-bit Fast Random 86 9714 6164 63%
32-bit Fast Random 129 11578 8215 71%
32-bit Fast Random 194 14730 11574 79%
32-bit Fast Random 291 18826 15140 80%
32-bit Fast Random 437 22290 17823 80%
32-bit Fast Random 656 25147 20028 80%
32-bit Fast Random 985 27815 21809 78%
32-bit Fast Random 1477 30166 23462 78%
32-bit Fast Random 2216 32120 25021 78%
32-bit Fast Random 3325 34327 27025 79%
32-bit Fast Random 4987 36472 28977 79%
32-bit Fast Random 7481 37922 30773 81%
32-bit Fast Random 11222 41226 33373 81%
32-bit Fast Random 16834 43412 35632 82%
32-bit Fast Random 25251 45771 37758 82%
32-bit Fast Random 37876 48730 40649 83%
32-bit Fast Random 56815 51714 41784 81%
32-bit Fast Random 85222 55157 45392 82%
32-bit Fast Random 127834 62789 50060 80%
32-bit Fast Random 191751 67766 54130 80%
32-bit Fast Random 287626 72776 62671 86%
32-bit Fast Random 431439 83700 73360 88%
32-bit Fast Random 647159 92774 80676 87%
32-bit Fast Stripes 1 2709 2105 78%
32-bit Fast Stripes 2 3139 2128 68%
32-bit Fast Stripes 3 3576 2179 61%
32-bit Fast Stripes 5 4432 2695 61%
32-bit Fast Stripes 7 4851 3503 72%
32-bit Fast Stripes 11 5560 3370 61%
32-bit Fast Stripes 17 6248 3782 61%
32-bit Fast Stripes 25 6833 4227 62%
32-bit Fast Stripes 38 7555 4656 62%
32-bit Fast Stripes 57 8280 5113 62%
32-bit Fast Stripes 86 9341 6156 66%
32-bit Fast Stripes 129 11213 10133 90%
32-bit Fast Stripes 194 14099 12749 90%
32-bit Fast Stripes 291 16631 13912 84%
32-bit Fast Stripes 437 18315 14099 77%
32-bit Fast Stripes 656 19717 14793 75%
32-bit Fast Stripes 985 21058 15619 74%
32-bit Fast Stripes 1477 22594 16556 73%
32-bit Fast Stripes 2216 24025 17637 73%
32-bit Fast Stripes 3325 25079 18135 72%
32-bit Fast Stripes 4987 25829 18754 73%
32-bit Fast Stripes 7481 26408 18641 71%
32-bit Fast Stripes 11222 26459 19383 73%
32-bit Fast Stripes 16834 27559 20301 74%
32-bit Fast Stripes 25251 27871 20475 73%
32-bit Fast Stripes 37876 28710 21017 73%
32-bit Fast Stripes 56815 29344 21536 73%
32-bit Fast Stripes 85222 30235 21927 73%
32-bit Fast Stripes 127834 31001 22250 72%
32-bit Fast Stripes 191751 31973 23334 73%
32-bit Fast Stripes 287626 32705 22789 70%
32-bit Fast Stripes 431439 33400 22884 69%
32-bit Fast Stripes 647159 34255 23329 68%
32-bit Slow Random 1 2754 1569 57%
32-bit Slow Random 2 3252 2150 66%
32-bit Slow Random 3 3519 2441 69%
32-bit Slow Random 5 4278 2717 64%
32-bit Slow Random 7 4683 2938 63%
32-bit Slow Random 11 5349 3370 63%
32-bit Slow Random 17 6042 3776 62%
32-bit Slow Random 25 6732 4176 62%
32-bit Slow Random 38 7510 4681 62%
32-bit Slow Random 57 8299 5175 62%
32-bit Slow Random 86 9390 6026 64%
32-bit Slow Random 129 11249 7994 71%
32-bit Slow Random 194 14460 11316 78%
32-bit Slow Random 291 18418 15042 82%
32-bit Slow Random 437 21749 17669 81%
32-bit Slow Random 656 24516 19836 81%
32-bit Slow Random 985 27018 21645 80%
32-bit Slow Random 1477 29372 23270 79%
32-bit Slow Random 2216 31373 24988 80%
32-bit Slow Random 3325 33480 26968 81%
32-bit Slow Random 4987 35775 28859 81%
32-bit Slow Random 7481 37318 30556 82%
32-bit Slow Random 11222 40510 33298 82%
32-bit Slow Random 16834 42861 35511 83%
32-bit Slow Random 25251 45199 37817 84%
32-bit Slow Random 37876 48143 40468 84%
32-bit Slow Random 56815 51065 41886 82%
32-bit Slow Random 85222 54240 45527 84%
32-bit Slow Random 127834 61788 49391 80%
32-bit Slow Random 191751 66545 54098 81%
32-bit Slow Random 287626 71535 62515 87%
32-bit Slow Random 431439 82638 73199 89%
32-bit Slow Random 647159 91785 80820 88%
32-bit Slow Stripes 1 2566 1609 63%
32-bit Slow Stripes 2 3255 2158 66%
32-bit Slow Stripes 3 3658 2197 60%
32-bit Slow Stripes 5 4448 2575 58%
32-bit Slow Stripes 7 4828 2742 57%
32-bit Slow Stripes 11 5608 3293 59%
32-bit Slow Stripes 17 6308 3722 59%
32-bit Slow Stripes 25 6974 4163 60%
32-bit Slow Stripes 38 7737 4702 61%
32-bit Slow Stripes 57 8496 5363 63%
32-bit Slow Stripes 86 9547 6331 66%
32-bit Slow Stripes 129 11444 8518 74%
32-bit Slow Stripes 194 14536 12016 83%
32-bit Slow Stripes 291 17106 13837 81%
32-bit Slow Stripes 437 18848 14712 78%
32-bit Slow Stripes 656 20029 15426 77%
32-bit Slow Stripes 985 21394 16222 76%
32-bit Slow Stripes 1477 22882 17187 75%
32-bit Slow Stripes 2216 24368 18375 75%
32-bit Slow Stripes 3325 25401 19078 75%
32-bit Slow Stripes 4987 26200 19627 75%
32-bit Slow Stripes 7481 26911 19792 74%
32-bit Slow Stripes 11222 26804 20512 77%
32-bit Slow Stripes 16834 27898 21346 77%
32-bit Slow Stripes 25251 28293 21623 76%
32-bit Slow Stripes 37876 29103 22164 76%
32-bit Slow Stripes 56815 29826 22656 76%
32-bit Slow Stripes 85222 30640 23109 75%
32-bit Slow Stripes 127834 31342 23362 75%
32-bit Slow Stripes 191751 32218 23748 74%
32-bit Slow Stripes 287626 33011 24152 73%
32-bit Slow Stripes 431439 33798 24316 72%
32-bit Slow Stripes 647159 34558 25514 74%
64-bit Fast Random 1 3281 1540 47%
64-bit Fast Random 2 3480 2193 63%
64-bit Fast Random 3 3583 2340 65%
64-bit Fast Random 5 4305 2961 69%
64-bit Fast Random 7 4661 3254 70%
64-bit Fast Random 11 5145 3656 71%
64-bit Fast Random 17 5706 4166 73%
64-bit Fast Random 25 6162 4611 75%
64-bit Fast Random 38 6694 5166 77%
64-bit Fast Random 57 7473 5729 77%
64-bit Fast Random 86 8519 6442 76%
64-bit Fast Random 129 9854 7644 78%
64-bit Fast Random 194 11764 9380 80%
64-bit Fast Random 291 14416 11913 83%
64-bit Fast Random 437 17021 14658 86%
64-bit Fast Random 656 19180 16903 88%
64-bit Fast Random 985 21156 19103 90%
64-bit Fast Random 1477 23068 21029 91%
64-bit Fast Random 2216 25149 23119 92%
64-bit Fast Random 3325 27142 25268 93%
64-bit Fast Random 4987 28888 27372 95%
64-bit Fast Random 7481 31334 28983 92%
64-bit Fast Random 11222 33719 32051 95%
64-bit Fast Random 16834 35731 34515 97%
64-bit Fast Random 25251 37075 36623 99%
64-bit Fast Random 37876 40015 38539 96%
64-bit Fast Random 56815 43544 41451 95%
64-bit Fast Random 85222 49711 46549 94%
64-bit Fast Random 127834 55993 52586 94%
64-bit Fast Random 191751 59810 61414 103%
64-bit Fast Random 287626 67375 71425 106%
64-bit Fast Random 431439 78577 80692 103%
64-bit Fast Random 647159 87980 90743 103%
64-bit Fast Stripes 1 3279 1431 44%
64-bit Fast Stripes 2 3473 2091 60%
64-bit Fast Stripes 3 3550 2447 69%
64-bit Fast Stripes 5 4059 3115 77%
64-bit Fast Stripes 7 4295 3341 78%
64-bit Fast Stripes 11 4900 3979 81%
64-bit Fast Stripes 17 5401 4394 81%
64-bit Fast Stripes 25 5867 4843 83%
64-bit Fast Stripes 38 6495 5358 82%
64-bit Fast Stripes 57 7167 5968 83%
64-bit Fast Stripes 86 8043 6673 83%
64-bit Fast Stripes 129 9411 7977 85%
64-bit Fast Stripes 194 11083 9854 89%
64-bit Fast Stripes 291 12348 11298 91%
64-bit Fast Stripes 437 13753 12522 91%
64-bit Fast Stripes 656 15078 13633 90%
64-bit Fast Stripes 985 16248 14627 90%
64-bit Fast Stripes 1477 17710 16074 91%
64-bit Fast Stripes 2216 19041 17302 91%
64-bit Fast Stripes 3325 19798 18214 92%
64-bit Fast Stripes 4987 20514 17636 86%
64-bit Fast Stripes 7481 20356 18779 92%
64-bit Fast Stripes 11222 21430 19470 91%
64-bit Fast Stripes 16834 21538 20075 93%
64-bit Fast Stripes 25251 22023 20505 93%
64-bit Fast Stripes 37876 22423 21243 95%
64-bit Fast Stripes 56815 22690 21798 96%
64-bit Fast Stripes 85222 23113 22449 97%
64-bit Fast Stripes 127834 23621 22993 97%
64-bit Fast Stripes 191751 24549 23938 98%
64-bit Fast Stripes 287626 25065 24182 96%
64-bit Fast Stripes 431439 25294 24632 97%
64-bit Fast Stripes 647159 25970 25347 98%
64-bit Slow Random 1 3275 1406 43%
64-bit Slow Random 2 3466 2104 61%
64-bit Slow Random 3 3535 2213 63%
64-bit Slow Random 5 4085 2892 71%
64-bit Slow Random 7 4401 3268 74%
64-bit Slow Random 11 4848 3947 81%
64-bit Slow Random 17 5384 4547 84%
64-bit Slow Random 25 5836 5122 88%
64-bit Slow Random 38 6399 5783 90%
64-bit Slow Random 57 7228 6447 89%
64-bit Slow Random 86 8235 7274 88%
64-bit Slow Random 129 9634 8463 88%
64-bit Slow Random 194 11639 10253 88%
64-bit Slow Random 291 13688 12651 92%
64-bit Slow Random 437 15998 15098 94%
64-bit Slow Random 656 18175 17255 95%
64-bit Slow Random 985 20339 19425 96%
64-bit Slow Random 1477 22054 21375 97%
64-bit Slow Random 2216 24015 23644 98%
64-bit Slow Random 3325 26095 25874 99%
64-bit Slow Random 4987 27844 27937 100%
64-bit Slow Random 7481 30249 29707 98%
64-bit Slow Random 11222 32454 32832 101%
64-bit Slow Random 16834 34626 35323 102%
64-bit Slow Random 25251 36127 37509 104%
64-bit Slow Random 37876 39124 39568 101%
64-bit Slow Random 56815 42500 42506 100%
64-bit Slow Random 85222 48920 47964 98%
64-bit Slow Random 127834 55600 54124 97%
64-bit Slow Random 191751 59350 62839 106%
64-bit Slow Random 287626 66998 72918 109%
64-bit Slow Random 431439 78148 82267 105%
64-bit Slow Random 647159 87389 92557 106%
64-bit Slow Stripes 1 3276 1522 46%
64-bit Slow Stripes 2 3521 2284 65%
64-bit Slow Stripes 3 3555 2210 62%
64-bit Slow Stripes 5 4231 2877 68%
64-bit Slow Stripes 7 4391 3088 70%
64-bit Slow Stripes 11 4996 3696 74%
64-bit Slow Stripes 17 5423 4138 76%
64-bit Slow Stripes 25 5831 4602 79%
64-bit Slow Stripes 38 6365 5073 80%
64-bit Slow Stripes 57 6937 5605 81%
64-bit Slow Stripes 86 7755 6381 82%
64-bit Slow Stripes 129 9025 7556 84%
64-bit Slow Stripes 194 10670 9387 88%
64-bit Slow Stripes 291 12012 10983 91%
64-bit Slow Stripes 437 13252 12188 92%
64-bit Slow Stripes 656 14588 13283 91%
64-bit Slow Stripes 985 15800 14311 91%
64-bit Slow Stripes 1477 17229 15695 91%
64-bit Slow Stripes 2216 18491 16951 92%
64-bit Slow Stripes 3325 19295 17640 91%
64-bit Slow Stripes 4987 20063 17192 86%
64-bit Slow Stripes 7481 19797 18310 92%
64-bit Slow Stripes 11222 20821 18912 91%
64-bit Slow Stripes 16834 20926 19503 93%
64-bit Slow Stripes 25251 21411 19897 93%
64-bit Slow Stripes 37876 21724 20599 95%
64-bit Slow Stripes 56815 22096 21259 96%
64-bit Slow Stripes 85222 22456 21758 97%
64-bit Slow Stripes 127834 23013 22292 97%
64-bit Slow Stripes 191751 23893 23282 97%
64-bit Slow Stripes 287626 24366 23505 96%
64-bit Slow Stripes 431439 24783 23767 96%
64-bit Slow Stripes 647159 25399 24547 97%

@manofstick
Copy link
Contributor Author

manofstick commented Jul 28, 2018

Argh, stuffed up these tests. Will fix soon... ish...

@manofstick
Copy link
Contributor Author

Set comparison test

Gist: https://gist.github.com/manofstick/20ac739756c2063a0a2910e388622866

Bittage Current #5360 %
x86 2252 867 39%
x64 1805 889 49%

@manofstick manofstick changed the title [CompilerPerf] [WIP] Optimizations on Map (post #5307) [CompilerPerf] [WIP] Optimizations on Map and Set (post #5307) Aug 6, 2018
@forki
Copy link
Contributor

forki commented May 29, 2019

@buybackoff are you going to send a Pull request?

@forki
Copy link
Contributor

forki commented May 29, 2019

what's this Unsafe.As thingy?

@buybackoff
Copy link
Contributor

@buybackoff are you going to send a Pull request?

Only if it will be merged. This PR is open so I do not understand how it will work if I send another one.

I believe updating only the layout and keeping height instead of size will be a quite straightforward single file change with near 2x perf boost for one of the most important collection, so it's very tempting to do that myself.

If it will take many days or weeks of discussing anything else than low hanging major performance win I will rather pass. I only need ISeries members, most important of which are not a part of F# Map. I will likely rewrite the whole thing in highly optimized C# if there is no clear path to merge this new layout in this repo.

If I change the code back to using IComparer<> the gains are still there. When using Comparer<_>.Default performance is close to KeyComparer, probably JIT doing it's work right (devirt, netcore3.0). When I cast KeyComparer to IComparer<> performance is "just" 30% above current FSharpMap. So only changing the layout is a big win even with IComparer.

Case MOPS Elapsed GC0 GC1 GC2 Memory
Get F# Core 6.76 148 ms 0.0 0.0 0.0 0.000 MB
Get IComparer (KeyComparer :> IComparer<'K>) 9.01 111 ms 3.0 0.0 0.0 4.961 MB
Get IComparer (Comparer<_>.Default) 11.63 86 ms 0.0 0.0 0.0 0.000 MB
Get KeyComparer 12.82 78 ms 0.0 0.0 0.0 0.000 MB

@buybackoff
Copy link
Contributor

what's this Unsafe.As thingy?

Absolutely zero-cost cast. Allows to work with an instance as if it is from a different type. Slow Span<T> is implemented this way via Pinnable<T> hack (no longer on GH, but here Vec has the same imlementation).

At IL level it's just ret with a different type. It could be "safe unsafe" if you are 100% sure that the cast is valid like in this case with rebalance, or it could be very "unsafe unsafe".

@buybackoff
Copy link
Contributor

buybackoff commented May 29, 2019

When I cast KeyComparer to IComparer<> performance is "just" 30% above current FSharpMap.

I was boxing KeyComparer on every operation, so non-zero GC/memory. Actually that case is 40% faster and GC/mem is zero.

Case MOPS Elapsed GC0 GC1 GC2 Memory
Get F# Core 6.76 148 ms 0.0 0.0 0.0 0.000 MB
Get IComparer (KeyComparer :> IComparer<'K>) 9.52 105 ms 0.0 0.0 0.0 0.000 MB
Get IComparer (Comparer<_>.Default) 11.63 86 ms 0.0 0.0 0.0 0.000 MB
Get S.C.I.ISD 11.63 86 ms 0.0 0.0 0.0 0.001 MB
Get KeyComparer 12.82 78 ms 0.0 0.0 0.0 0.000 MB

The case with default comparer just becomes ImmutableSortedDictionary, which is implemented in the same way as MapTreeNode and single-case DU in this PR. But it takes more memory:

Case MOPS Elapsed GC0 GC1 GC2 Memory
Add F# Core 1.01 990 ms 146.2 6.5 1.8 46.226 MB
Add S.C.I.ISD 1.18 846 ms 146.3 7.1 1.9 58.328 MB
Add KeyComparer 1.85 541 ms 150.0 4.0 2.0 45.356 MB

The difference between 46 and 58 MB is expected since the number of leaves and nodes is approximately equal for balanced tree. (MapOne 32+ MapNode 56)/2 = 44 vs just 56 bytes for ImmutableSortedDictionary and wide single-case DU.

@manofstick
Copy link
Contributor Author

@buybackoff

Haven't looked at this in detail, but from what I can see at a cursory glance is that you're not using the LanguagePrimitives comparer. This was the whole point of #5307 and it's predecessors, which this PR is really just finalizing.

Also in the code as it is in this PR, the single value discriminated union means and some null checks also should be (basically) same performance as the change to the class. I had done some testing at the time and found negligible differences.

Let me know if I'm misreading your contribution.

Thanks,
Paul.

@buybackoff
Copy link
Contributor

Let me know if I'm misreading your contribution.

Memory usage. It's the worst thing about trees and you proposed to make it worse for no particular reason.

In my original comment I quoted you, but not fully:

The down side is that it does increase memory for leaf-nodes by 2*sizeof+sizeof. But given that it's sizeof+sizeof+object overhead where object overhead = 8 for 32-bit or 16 for 64-bit JIT I don't think this is a showstopper, but let me know (sooner rather than later would be good...)

Everything else is similar to S.C.I.ISD and single-case wide DU from this PR, as I wrote in the last comment above.

Another point is that your changes are so big that merging it will take time and is blocked on 5307, while there are three independent changes:

  • Comparer
  • Layout
  • Size vs. height

They could be in any order and released incrementally. Changing layout is the simplest of them with quick performance gain.

@manofstick
Copy link
Contributor Author

@buybackoff

Memory usage. It's the worst thing about trees and you proposed to make it worse for no particular reason.

% wise for the complete objects it's small. If you are worried about memory then you're using the wrong data structure.

They could be in any order and released incrementally. Changing layout is the simplest of them with quick performance gain.

Well just go and create a different PR then. I don't think any comments here do anything but muddy the already murky water of this PR chain?

@buybackoff
Copy link
Contributor

buybackoff commented May 30, 2019

% wise for the complete objects it's small.

32/56 bytes combination is quite typical. Useful payload is 16, the waste is 44-16 = 28 for small leaves and 56 - 16 = 40 for single-case wide DU. So there is 12 bytes more waste per 16 bytes payload. Or in different terms, using 2.75x more memory than needed vs 3.5x.

If you are worried about memory then you're using the wrong data structure.

Actually I need this data structure for structural sharing to save memory and void copying large data when I need maps at different timestamps with little changes. There is no real-world reason to use trees as an IReadonlyDictionary replacement unless structural sharing is required. FP's immutability is a case of structural sharing but with a different emphasis on safety from shared mutable state rather than memory.

@manofstick
Copy link
Contributor Author

OK, I hadn't looked properly. Leaves are smaller type, and using inheritance for other nodes. OK, cool, sounds good.

@zpodlovics
Copy link

The overhead could be smaller if StructLayout attribute is enabled / applied to DU (Please note, classes are also allowed to have StructLayout attribute and field layout attributes):

#5215

@buybackoff
Copy link
Contributor

@zpodlovics

The overhead could be smaller if StructLayout attribute is enabled / applied to DU

Why? Additional overhead to leaves is pointer + pointer + height/size: int. The int field is always padded to 8 bytes on x64 in this case.

@zpodlovics
Copy link

zpodlovics commented May 30, 2019

If you pack the fields (Pack=1 attribute) there will be no padding. Less memory consumption in exchange for (a bit) ~less efficient data access. Also unless the layout is fixed and pack=1 is specified the JIT free to reorder the fields and change the padding, so using the Unsafe.As with may give some surprising result.

@buybackoff
Copy link
Contributor

A smaller find method. It is (and was) on par with C# loop and is translated to almost identical code, but bad habit of adding [<MethodImpl(MethodImplOptions.AggressiveInlining)>] backfired and Compare was not inlined probably due to JIT limit on how much it could inline even with the attribute. That did not affect IComparer case.

  let rec find (comparer: KeyComparer<'K>) k (m:MapTree<'K,'b>) =
    if isEmpty m then raise (System.Collections.Generic.KeyNotFoundException())
    else
      let c = comparer.Compare(k,m.Key)
      if c = 0 then m.Value
      else
        match m with
        | :? MapTreeNode<'K,'b> as mn ->
          find comparer k (if c < 0 then mn.Left else mn.Right)
        | _ -> raise (System.Collections.Generic.KeyNotFoundException())
Case MOPS Elapsed GC0 GC1 GC2 Memory
Get KeyComparer 14.47 69 ms 0.0 0.0 0.0 0.000 MB
Get S.C.I.ISD 11.84 84 ms 0.0 0.0 0.0 0.000 MB
Get IComparer (Comparer<_>.Default) 11.25 89 ms 0.0 0.0 0.0 0.000 MB
Get F# Core 6.86 146 ms 0.0 0.0 0.0 0.000 MB

(The benchmark is adding/getting 1M <long,long> in a for loop, repeated 100 time in a process with real-time priority)

The last two lines in the table show only the effect of layout change, the first line is with inlined Compare.

@buybackoff
Copy link
Contributor

@zpodlovics

If you pack the fields (Pack=1 attribute) there will be no padding. Less memory consumption in exchange for (a bit) ~less efficient data access. Also unless the layout is fixed and pack=1 is specified the JIT free to reorder the fields and change the padding, so using the Unsafe.As with may give some surprising result.

.NET guarantees field alignment so that access to them is atomic (when < word size). When possible to pack without breaking pointer alignment .NET does just right thing. Using https://github.com/SergeyTeplyakov/ObjectLayoutInspector for MapTreeNode<short,short>>

Type layout for 'MapTreeNode`2'
Size: 24 bytes. Paddings: 0 bytes (%0 of empty space)
|===================================|
| Object Header (8 bytes)           |
|-----------------------------------|
| Method Table Ptr (8 bytes)        |
|===================================|
|   0-1: Int16 Key@ (2 bytes)       |
|-----------------------------------|
|   2-3: Int16 Value@ (2 bytes)     |
|-----------------------------------|
|   4-7: Int32 Height@ (4 bytes)    |
|-----------------------------------|
|  8-15: MapTree`2 Left@ (8 bytes)  |
|-----------------------------------|
| 16-23: MapTree`2 Right@ (8 bytes) |
|===================================|

Note that Height us moved above pointers. Same is true for MapTreeNode<long,int>:

Type layout for 'MapTreeNode`2'
Size: 32 bytes. Paddings: 0 bytes (%0 of empty space)
|===================================|
| Object Header (8 bytes)           |
|-----------------------------------|
| Method Table Ptr (8 bytes)        |
|===================================|
|   0-7: Int64 Key@ (8 bytes)       |
|-----------------------------------|
|  8-11: Int32 Value@ (4 bytes)     |
|-----------------------------------|
| 12-15: Int32 Height@ (4 bytes)    |
|-----------------------------------|
| 16-23: MapTree`2 Left@ (8 bytes)  |
|-----------------------------------|
| 24-31: MapTree`2 Right@ (8 bytes) |
|===================================|

Tight packing for structs only works when there are no reference type fields.

@KevinRansom
Copy link
Member

@manofstick , @dsyme -- please don't think I am picking on you :-) do we want to pursue this PR? the last commit was nearly two years ago. I am doing my best to get the number of PR's open on the repo down to a reasonable number of "owned -- active PRs"

Thanks

Kevin

@manofstick
Copy link
Contributor Author

@KevinRansom - Meh, I think there is zero interest, so killing it.

@manofstick manofstick closed this May 28, 2020
@abelbraaksma
Copy link
Contributor

@manofstick, certainly not zero. These and other perf improvements looked really promising. I think recently, some of the ideas about optimizing generic comparisons for structs like DateTimes were restarted, though that seemed more like cherry picking as opposed to your general solution here.

It would be nice if we could resurrect this, and related PRs, somehow.

@KevinRansom
Copy link
Member

I think setting a goal and driving towards that one thing may be the way forward. These PR's were very broad and difficult to get ones head around. I know @TIHan is similarly frustrated, by the fact that we let expediency drive our development rather than a big rethink. Anyway, I am sorry to see these PR's go, but the repo has got very messy over the last couple of years with orphaned PR's. Usually the first page turns over pretty rapidly but if a PR migrates to the second or third page, it is pretty much toast for attention span.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented May 28, 2020

@KevinRansom Yes. Perhaps we need some way to champion certain PRs that have a broad benefit to the community, and/or are nearly complete and just need that extra push.

I think about like what @cartermp was doing with the three-weekly road map updates in a pinned issue. Or somewhat like C# is doing with the championing, but then for PRs. Otherwise we risk losing some of those excellent community participations.

buybackoff added a commit to buybackoff/fsharp that referenced this pull request Sep 25, 2020
Following discussion and POC code from dotnet#5360 (comment)

Changes are very straightforward and do not touch public API:

* Performance improves by a huge margin
* Code size is smaller or same
* Memory is same
* No low level tricks, just simple code (see `asNode` comments for potential micro-optimizations, which are not visible after all; these comments are to be deleted)

Benchmarks code is here: https://github.com/buybackoff/fsharp-benchmarks

|      Method |    Job | BuildConfiguration |     Size |              Mean |            Error |           StdDev | Rank |      Gen 0 |    Gen 1 |   Gen 2 |   Allocated | Code Size |
|------------ |------- |------------------- |--------- |------------------:|-----------------:|-----------------:|-----:|-----------:|---------:|--------:|------------:|----------:|
|     getItem |  After |         LocalBuild |      100 |          36.21 ns |         0.199 ns |         0.167 ns |    1 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default |      100 |          62.51 ns |         0.143 ns |         0.127 ns |    2 |          - |        - |       - |           - |     126 B |
|     getItem |  After |         LocalBuild |    10000 |          76.57 ns |         0.140 ns |         0.124 ns |    3 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default |    10000 |         120.02 ns |         0.182 ns |         0.170 ns |    4 |          - |        - |       - |           - |     126 B |
|     getItem |  After |         LocalBuild | 10000000 |         129.45 ns |         0.126 ns |         0.118 ns |    5 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default | 10000000 |         209.35 ns |         0.496 ns |         0.464 ns |    6 |          - |        - |       - |           - |     126 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
| containsKey |  After |         LocalBuild |      100 |          35.63 ns |         0.201 ns |         0.188 ns |    1 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default |      100 |          64.01 ns |         0.351 ns |         0.328 ns |    2 |          - |        - |       - |           - |     276 B |
| containsKey |  After |         LocalBuild |    10000 |          65.63 ns |         0.150 ns |         0.125 ns |    3 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default |    10000 |         123.82 ns |         0.149 ns |         0.139 ns |    5 |          - |        - |       - |           - |     276 B |
| containsKey |  After |         LocalBuild | 10000000 |          95.05 ns |         0.082 ns |         0.072 ns |    4 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default | 10000000 |         204.39 ns |         0.338 ns |         0.282 ns |    6 |          - |        - |       - |           - |     276 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|   itemCount |  After |         LocalBuild |      100 |         231.39 ns |         0.406 ns |         0.360 ns |    1 |          - |        - |       - |           - |      96 B |
|   itemCount | Before |            Default |      100 |         539.74 ns |         1.923 ns |         1.798 ns |    2 |          - |        - |       - |           - |     151 B |
|   itemCount |  After |         LocalBuild |    10000 |      33,160.50 ns |       194.709 ns |       182.131 ns |    3 |          - |        - |       - |           - |      96 B |
|   itemCount | Before |            Default |    10000 |      63,074.34 ns |       138.682 ns |       129.724 ns |    4 |          - |        - |       - |           - |     151 B |
|   itemCount |  After |         LocalBuild | 10000000 |  62,332,911.90 ns |   252,973.481 ns |   224,254.402 ns |    5 |          - |        - |       - |       148 B |      96 B |
|   itemCount | Before |            Default | 10000000 |  94,745,625.56 ns |   205,640.690 ns |   192,356.429 ns |    6 |          - |        - |       - |           - |     151 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
| iterForeach |  After |         LocalBuild |      100 |       3,355.75 ns |         9.540 ns |         7.448 ns |    1 |     0.9727 |        - |       - |      6120 B |     291 B |
| iterForeach | Before |            Default |      100 |       3,866.56 ns |        10.148 ns |         8.996 ns |    2 |     0.9689 |        - |       - |      6120 B |     291 B |
| iterForeach |  After |         LocalBuild |    10000 |     348,359.43 ns |     1,148.753 ns |       959.260 ns |    3 |    95.2148 |        - |       - |    600120 B |     291 B |
| iterForeach | Before |            Default |    10000 |     398,419.61 ns |       513.959 ns |       480.758 ns |    4 |    95.2148 |        - |       - |    600120 B |     291 B |
| iterForeach |  After |         LocalBuild | 10000000 | 391,889,200.00 ns | 1,604,306.946 ns | 1,500,669.712 ns |    5 | 95000.0000 |        - |       - | 600000120 B |     321 B |
| iterForeach | Before |            Default | 10000000 | 445,099,028.57 ns | 1,380,498.715 ns | 1,223,776.153 ns |    6 | 95000.0000 |        - |       - | 600000120 B |     321 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|     addItem |  After |         LocalBuild |      100 |         181.25 ns |         0.961 ns |         0.899 ns |    1 |     0.0586 |   0.0003 |       - |       369 B |     621 B |
|     addItem | Before |            Default |      100 |         311.85 ns |         0.601 ns |         0.562 ns |    2 |     0.0586 |        - |       - |       369 B |     697 B |
|     addItem |  After |         LocalBuild |    10000 |      40,893.49 ns |       174.683 ns |       163.398 ns |    3 |    11.0156 |   3.2813 |       - |     69324 B |     621 B |
|     addItem | Before |            Default |    10000 |      71,746.33 ns |       130.309 ns |       121.891 ns |    4 |    11.0156 |   3.3594 |       - |     69324 B |     697 B |
|     addItem |  After |         LocalBuild | 10000000 |  87,178,251.47 ns |   250,148.324 ns |   233,988.898 ns |    5 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B |     621 B |
|     addItem | Before |            Default | 10000000 | 146,799,424.80 ns |   286,531.195 ns |   268,021.458 ns |    6 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B |     697 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|  removeItem |  After |         LocalBuild |      100 |          13.64 ns |         0.112 ns |         0.105 ns |    1 |     0.0064 |        - |       - |        40 B |     469 B |
|  removeItem | Before |            Default |      100 |          16.38 ns |         0.071 ns |         0.067 ns |    2 |     0.0064 |        - |       - |        40 B |     519 B |
|  removeItem |  After |         LocalBuild |    10000 |       1,329.24 ns |         9.087 ns |         8.056 ns |    3 |     0.6372 |        - |       - |      4000 B |     469 B |
|  removeItem | Before |            Default |    10000 |       1,607.21 ns |         5.566 ns |         5.206 ns |    4 |     0.6372 |        - |       - |      4000 B |     519 B |
|  removeItem |  After |         LocalBuild | 10000000 |   1,232,230.00 ns |     6,303.414 ns |     5,896.218 ns |    5 |   630.0000 |        - |       - |   4000000 B |     469 B |
|  removeItem | Before |            Default | 10000000 |   1,801,088.33 ns |     8,945.674 ns |     8,367.789 ns |    6 |   630.0000 |        - |       - |   4000000 B |     519 B |
vzarytovskii pushed a commit that referenced this pull request Sep 26, 2020
* FSharp.Core: Map: optimize tree layout

Following discussion and POC code from #5360 (comment)

Changes are very straightforward and do not touch public API:

* Performance improves by a huge margin
* Code size is smaller or same
* Memory is same
* No low level tricks, just simple code (see `asNode` comments for potential micro-optimizations, which are not visible after all; these comments are to be deleted)

Benchmarks code is here: https://github.com/buybackoff/fsharp-benchmarks

|      Method |    Job | BuildConfiguration |     Size |              Mean |            Error |           StdDev | Rank |      Gen 0 |    Gen 1 |   Gen 2 |   Allocated | Code Size |
|------------ |------- |------------------- |--------- |------------------:|-----------------:|-----------------:|-----:|-----------:|---------:|--------:|------------:|----------:|
|     getItem |  After |         LocalBuild |      100 |          36.21 ns |         0.199 ns |         0.167 ns |    1 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default |      100 |          62.51 ns |         0.143 ns |         0.127 ns |    2 |          - |        - |       - |           - |     126 B |
|     getItem |  After |         LocalBuild |    10000 |          76.57 ns |         0.140 ns |         0.124 ns |    3 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default |    10000 |         120.02 ns |         0.182 ns |         0.170 ns |    4 |          - |        - |       - |           - |     126 B |
|     getItem |  After |         LocalBuild | 10000000 |         129.45 ns |         0.126 ns |         0.118 ns |    5 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default | 10000000 |         209.35 ns |         0.496 ns |         0.464 ns |    6 |          - |        - |       - |           - |     126 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
| containsKey |  After |         LocalBuild |      100 |          35.63 ns |         0.201 ns |         0.188 ns |    1 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default |      100 |          64.01 ns |         0.351 ns |         0.328 ns |    2 |          - |        - |       - |           - |     276 B |
| containsKey |  After |         LocalBuild |    10000 |          65.63 ns |         0.150 ns |         0.125 ns |    3 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default |    10000 |         123.82 ns |         0.149 ns |         0.139 ns |    5 |          - |        - |       - |           - |     276 B |
| containsKey |  After |         LocalBuild | 10000000 |          95.05 ns |         0.082 ns |         0.072 ns |    4 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default | 10000000 |         204.39 ns |         0.338 ns |         0.282 ns |    6 |          - |        - |       - |           - |     276 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|   itemCount |  After |         LocalBuild |      100 |         231.39 ns |         0.406 ns |         0.360 ns |    1 |          - |        - |       - |           - |      96 B |
|   itemCount | Before |            Default |      100 |         539.74 ns |         1.923 ns |         1.798 ns |    2 |          - |        - |       - |           - |     151 B |
|   itemCount |  After |         LocalBuild |    10000 |      33,160.50 ns |       194.709 ns |       182.131 ns |    3 |          - |        - |       - |           - |      96 B |
|   itemCount | Before |            Default |    10000 |      63,074.34 ns |       138.682 ns |       129.724 ns |    4 |          - |        - |       - |           - |     151 B |
|   itemCount |  After |         LocalBuild | 10000000 |  62,332,911.90 ns |   252,973.481 ns |   224,254.402 ns |    5 |          - |        - |       - |       148 B |      96 B |
|   itemCount | Before |            Default | 10000000 |  94,745,625.56 ns |   205,640.690 ns |   192,356.429 ns |    6 |          - |        - |       - |           - |     151 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
| iterForeach |  After |         LocalBuild |      100 |       3,355.75 ns |         9.540 ns |         7.448 ns |    1 |     0.9727 |        - |       - |      6120 B |     291 B |
| iterForeach | Before |            Default |      100 |       3,866.56 ns |        10.148 ns |         8.996 ns |    2 |     0.9689 |        - |       - |      6120 B |     291 B |
| iterForeach |  After |         LocalBuild |    10000 |     348,359.43 ns |     1,148.753 ns |       959.260 ns |    3 |    95.2148 |        - |       - |    600120 B |     291 B |
| iterForeach | Before |            Default |    10000 |     398,419.61 ns |       513.959 ns |       480.758 ns |    4 |    95.2148 |        - |       - |    600120 B |     291 B |
| iterForeach |  After |         LocalBuild | 10000000 | 391,889,200.00 ns | 1,604,306.946 ns | 1,500,669.712 ns |    5 | 95000.0000 |        - |       - | 600000120 B |     321 B |
| iterForeach | Before |            Default | 10000000 | 445,099,028.57 ns | 1,380,498.715 ns | 1,223,776.153 ns |    6 | 95000.0000 |        - |       - | 600000120 B |     321 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|     addItem |  After |         LocalBuild |      100 |         181.25 ns |         0.961 ns |         0.899 ns |    1 |     0.0586 |   0.0003 |       - |       369 B |     621 B |
|     addItem | Before |            Default |      100 |         311.85 ns |         0.601 ns |         0.562 ns |    2 |     0.0586 |        - |       - |       369 B |     697 B |
|     addItem |  After |         LocalBuild |    10000 |      40,893.49 ns |       174.683 ns |       163.398 ns |    3 |    11.0156 |   3.2813 |       - |     69324 B |     621 B |
|     addItem | Before |            Default |    10000 |      71,746.33 ns |       130.309 ns |       121.891 ns |    4 |    11.0156 |   3.3594 |       - |     69324 B |     697 B |
|     addItem |  After |         LocalBuild | 10000000 |  87,178,251.47 ns |   250,148.324 ns |   233,988.898 ns |    5 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B |     621 B |
|     addItem | Before |            Default | 10000000 | 146,799,424.80 ns |   286,531.195 ns |   268,021.458 ns |    6 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B |     697 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|  removeItem |  After |         LocalBuild |      100 |          13.64 ns |         0.112 ns |         0.105 ns |    1 |     0.0064 |        - |       - |        40 B |     469 B |
|  removeItem | Before |            Default |      100 |          16.38 ns |         0.071 ns |         0.067 ns |    2 |     0.0064 |        - |       - |        40 B |     519 B |
|  removeItem |  After |         LocalBuild |    10000 |       1,329.24 ns |         9.087 ns |         8.056 ns |    3 |     0.6372 |        - |       - |      4000 B |     469 B |
|  removeItem | Before |            Default |    10000 |       1,607.21 ns |         5.566 ns |         5.206 ns |    4 |     0.6372 |        - |       - |      4000 B |     519 B |
|  removeItem |  After |         LocalBuild | 10000000 |   1,232,230.00 ns |     6,303.414 ns |     5,896.218 ns |    5 |   630.0000 |        - |       - |   4000000 B |     469 B |
|  removeItem | Before |            Default | 10000000 |   1,801,088.33 ns |     8,945.674 ns |     8,367.789 ns |    6 |   630.0000 |        - |       - |   4000000 B |     519 B |

* Simplify node ctors

* FSharp.Core: Map: delete notes in asNode

* FSharp.Core: Map: fix typo in spliceOutSuccessor

* FSharp.Core: Map: remove unused open
nosami pushed a commit to xamarin/visualfsharp that referenced this pull request Feb 23, 2021
* FSharp.Core: Map: optimize tree layout

Following discussion and POC code from dotnet#5360 (comment)

Changes are very straightforward and do not touch public API:

* Performance improves by a huge margin
* Code size is smaller or same
* Memory is same
* No low level tricks, just simple code (see `asNode` comments for potential micro-optimizations, which are not visible after all; these comments are to be deleted)

Benchmarks code is here: https://github.com/buybackoff/fsharp-benchmarks

|      Method |    Job | BuildConfiguration |     Size |              Mean |            Error |           StdDev | Rank |      Gen 0 |    Gen 1 |   Gen 2 |   Allocated | Code Size |
|------------ |------- |------------------- |--------- |------------------:|-----------------:|-----------------:|-----:|-----------:|---------:|--------:|------------:|----------:|
|     getItem |  After |         LocalBuild |      100 |          36.21 ns |         0.199 ns |         0.167 ns |    1 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default |      100 |          62.51 ns |         0.143 ns |         0.127 ns |    2 |          - |        - |       - |           - |     126 B |
|     getItem |  After |         LocalBuild |    10000 |          76.57 ns |         0.140 ns |         0.124 ns |    3 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default |    10000 |         120.02 ns |         0.182 ns |         0.170 ns |    4 |          - |        - |       - |           - |     126 B |
|     getItem |  After |         LocalBuild | 10000000 |         129.45 ns |         0.126 ns |         0.118 ns |    5 |          - |        - |       - |           - |     126 B |
|     getItem | Before |            Default | 10000000 |         209.35 ns |         0.496 ns |         0.464 ns |    6 |          - |        - |       - |           - |     126 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
| containsKey |  After |         LocalBuild |      100 |          35.63 ns |         0.201 ns |         0.188 ns |    1 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default |      100 |          64.01 ns |         0.351 ns |         0.328 ns |    2 |          - |        - |       - |           - |     276 B |
| containsKey |  After |         LocalBuild |    10000 |          65.63 ns |         0.150 ns |         0.125 ns |    3 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default |    10000 |         123.82 ns |         0.149 ns |         0.139 ns |    5 |          - |        - |       - |           - |     276 B |
| containsKey |  After |         LocalBuild | 10000000 |          95.05 ns |         0.082 ns |         0.072 ns |    4 |          - |        - |       - |           - |     177 B |
| containsKey | Before |            Default | 10000000 |         204.39 ns |         0.338 ns |         0.282 ns |    6 |          - |        - |       - |           - |     276 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|   itemCount |  After |         LocalBuild |      100 |         231.39 ns |         0.406 ns |         0.360 ns |    1 |          - |        - |       - |           - |      96 B |
|   itemCount | Before |            Default |      100 |         539.74 ns |         1.923 ns |         1.798 ns |    2 |          - |        - |       - |           - |     151 B |
|   itemCount |  After |         LocalBuild |    10000 |      33,160.50 ns |       194.709 ns |       182.131 ns |    3 |          - |        - |       - |           - |      96 B |
|   itemCount | Before |            Default |    10000 |      63,074.34 ns |       138.682 ns |       129.724 ns |    4 |          - |        - |       - |           - |     151 B |
|   itemCount |  After |         LocalBuild | 10000000 |  62,332,911.90 ns |   252,973.481 ns |   224,254.402 ns |    5 |          - |        - |       - |       148 B |      96 B |
|   itemCount | Before |            Default | 10000000 |  94,745,625.56 ns |   205,640.690 ns |   192,356.429 ns |    6 |          - |        - |       - |           - |     151 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
| iterForeach |  After |         LocalBuild |      100 |       3,355.75 ns |         9.540 ns |         7.448 ns |    1 |     0.9727 |        - |       - |      6120 B |     291 B |
| iterForeach | Before |            Default |      100 |       3,866.56 ns |        10.148 ns |         8.996 ns |    2 |     0.9689 |        - |       - |      6120 B |     291 B |
| iterForeach |  After |         LocalBuild |    10000 |     348,359.43 ns |     1,148.753 ns |       959.260 ns |    3 |    95.2148 |        - |       - |    600120 B |     291 B |
| iterForeach | Before |            Default |    10000 |     398,419.61 ns |       513.959 ns |       480.758 ns |    4 |    95.2148 |        - |       - |    600120 B |     291 B |
| iterForeach |  After |         LocalBuild | 10000000 | 391,889,200.00 ns | 1,604,306.946 ns | 1,500,669.712 ns |    5 | 95000.0000 |        - |       - | 600000120 B |     321 B |
| iterForeach | Before |            Default | 10000000 | 445,099,028.57 ns | 1,380,498.715 ns | 1,223,776.153 ns |    6 | 95000.0000 |        - |       - | 600000120 B |     321 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|     addItem |  After |         LocalBuild |      100 |         181.25 ns |         0.961 ns |         0.899 ns |    1 |     0.0586 |   0.0003 |       - |       369 B |     621 B |
|     addItem | Before |            Default |      100 |         311.85 ns |         0.601 ns |         0.562 ns |    2 |     0.0586 |        - |       - |       369 B |     697 B |
|     addItem |  After |         LocalBuild |    10000 |      40,893.49 ns |       174.683 ns |       163.398 ns |    3 |    11.0156 |   3.2813 |       - |     69324 B |     621 B |
|     addItem | Before |            Default |    10000 |      71,746.33 ns |       130.309 ns |       121.891 ns |    4 |    11.0156 |   3.3594 |       - |     69324 B |     697 B |
|     addItem |  After |         LocalBuild | 10000000 |  87,178,251.47 ns |   250,148.324 ns |   233,988.898 ns |    5 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B |     621 B |
|     addItem | Before |            Default | 10000000 | 146,799,424.80 ns |   286,531.195 ns |   268,021.458 ns |    6 | 18680.0000 | 960.0000 | 10.0000 | 117146915 B |     697 B |
|             |        |                    |          |                   |                  |                  |      |            |          |         |             |           |
|  removeItem |  After |         LocalBuild |      100 |          13.64 ns |         0.112 ns |         0.105 ns |    1 |     0.0064 |        - |       - |        40 B |     469 B |
|  removeItem | Before |            Default |      100 |          16.38 ns |         0.071 ns |         0.067 ns |    2 |     0.0064 |        - |       - |        40 B |     519 B |
|  removeItem |  After |         LocalBuild |    10000 |       1,329.24 ns |         9.087 ns |         8.056 ns |    3 |     0.6372 |        - |       - |      4000 B |     469 B |
|  removeItem | Before |            Default |    10000 |       1,607.21 ns |         5.566 ns |         5.206 ns |    4 |     0.6372 |        - |       - |      4000 B |     519 B |
|  removeItem |  After |         LocalBuild | 10000000 |   1,232,230.00 ns |     6,303.414 ns |     5,896.218 ns |    5 |   630.0000 |        - |       - |   4000000 B |     469 B |
|  removeItem | Before |            Default | 10000000 |   1,801,088.33 ns |     8,945.674 ns |     8,367.789 ns |    6 |   630.0000 |        - |       - |   4000000 B |     519 B |

* Simplify node ctors

* FSharp.Core: Map: delete notes in asNode

* FSharp.Core: Map: fix typo in spliceOutSuccessor

* FSharp.Core: Map: remove unused open
@vzarytovskii
Copy link
Member

vzarytovskii commented Oct 16, 2023

It's a shame all these PRs been abandoned. I will be resurrecting them post .NET8 release (in november). Compiler perf should be among top priorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Library Issues for FSharp.Core not covered elsewhere
Projects
None yet
Development

Successfully merging this pull request may close these issues.