Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite HashSet<T>'s implementation based on Dictionary<T>'s #37180

Merged
merged 9 commits into from
May 31, 2020

Conversation

stephentoub
Copy link
Member

Fixes #37111
Contributes to #1989

This moves HashSet into corelib, and then effectively deletes HashSet's data structure and replaces it with the one used by Dictionary, then updated for the differences (e.g. just a value rather than a key and a value). HashSet used to have basically the same implementation, but Dictionary has evolved significantly and HashSet hasn't; this brings them to basic parity on implementation.

Based on perf tests, I veered away from Dictionary's implementation in a few places (e.g. a goto-based implementation in the core find method led to a significant regression for Int32-based Contains operations), and we should follow-up to understand whether Dictionary should be changed as well, or why there's a difference between the two. @benaadams, if you have some time, it'd probably worth looking at this again; maybe you'll get different numbers than I did.

Functionally, bringing over Dictionary's implementation yields a few notable changes, namely that Remove and Clear no longer invalidate enumerations. The tests have been updated accordingly.

With HashSet now in corelib, I also updated two Dictionary uses I found in corelib that were using Dictionary as a set and switched them to use HashSet.

Running the dotnet/performance perf tests:

dotnet run -c Release -f net5.0 --filter System.Collections.*.HashSet --corerun d:\coreclrtest\master\corerun.exe d:\coreclrtest\pr\corerun.exe --join
Type Toolchain Size Mean Ratio Allocated
CtorDefaultSize<Int32> master ? 4.777 ns 1.00 64 B
CtorDefaultSize<Int32> pr ? 6.473 ns 1.36 72 B
CtorDefaultSize<String> master ? 13.706 ns 1.00 64 B
CtorDefaultSize<String> pr ? 13.156 ns 0.96 72 B
ContainsTrueComparer<Int32> master 512 5,882.158 ns 1.00 -
ContainsTrueComparer<Int32> pr 512 5,474.605 ns 0.93 -
ContainsTrueComparer<String> master 512 22,759.196 ns 1.00 -
ContainsTrueComparer<String> pr 512 22,814.504 ns 1.00 -
AddGivenSize<Int32> master 512 5,034.736 ns 1.00 8456 B
AddGivenSize<Int32> pr 512 4,118.598 ns 0.82 8464 B
AddGivenSize<String> master 512 17,223.668 ns 1.00 10536 B
AddGivenSize<String> pr 512 11,328.307 ns 0.66 10544 B
CreateAddAndRemove<Int32> master 512 13,473.343 ns 1.00 27712 B
CreateAddAndRemove<Int32> pr 512 11,269.323 ns 0.84 27720 B
CreateAddAndRemove<String> master 512 43,108.066 ns 1.00 34480 B
CreateAddAndRemove<String> pr 512 28,070.657 ns 0.65 34488 B
CtorFromCollection<Int32> master 512 8,066.837 ns 1.00 8488 B
CtorFromCollection<Int32> pr 512 6,705.302 ns 0.83 8496 B
CtorFromCollection<String> master 512 20,284.380 ns 1.00 10568 B
CtorFromCollection<String> pr 512 12,863.063 ns 0.63 10576 B
CtorGivenSize<Int32> master 512 505.879 ns 1.00 8456 B
CtorGivenSize<Int32> pr 512 493.664 ns 0.97 8464 B
CtorGivenSize<String> master 512 650.880 ns 1.00 10536 B
CtorGivenSize<String> pr 512 640.043 ns 0.98 10544 B
ContainsFalse<Int32> master 512 2,427.766 ns 1.00 -
ContainsFalse<Int32> pr 512 2,360.871 ns 0.97 -
ContainsFalse<String> master 512 15,994.814 ns 1.00 -
ContainsFalse<String> pr 512 8,593.030 ns 0.54 -
ContainsTrue<Int32> master 512 2,945.063 ns 1.00 -
ContainsTrue<Int32> pr 512 2,794.102 ns 0.95 -
ContainsTrue<String> master 512 16,805.925 ns 1.00 -
ContainsTrue<String> pr 512 10,333.508 ns 0.61 -
CreateAddAndClear<Int32> master 512 8,900.466 ns 1.00 27712 B
CreateAddAndClear<Int32> pr 512 7,554.481 ns 0.85 27720 B
CreateAddAndClear<String> master 512 22,007.874 ns 1.00 34480 B
CreateAddAndClear<String> pr 512 15,382.689 ns 0.70 34488 B
IterateForEach<Int32> master 512 1,348.107 ns 1.00 -
IterateForEach<Int32> pr 512 977.800 ns 0.73 -
IterateForEach<String> master 512 2,301.860 ns 1.00 -
IterateForEach<String> pr 512 1,589.925 ns 0.69 -

cc: @benaadams, @danmosemsft, @GrabYourPitchforks, @eiriktsarpalis, @layomia

@ghost
Copy link

ghost commented May 29, 2020

Tagging subscribers to this area: @eiriktsarpalis
Notify danmosemsft if you want to be subscribed.

@stephentoub stephentoub added the tenet-performance Performance related issue label May 29, 2020
@stephentoub stephentoub added this to the 5.0 milestone May 29, 2020
@benaadams
Copy link
Member

benaadams commented May 29, 2020

I veered away from Dictionary's implementation in a few places (e.g. a goto-based implementation in the core find method led to a significant regression for Int32-based Contains operations), and we should follow-up to understand whether Dictionary should be changed as well, or why there's a difference between the two. @benaadams, if you have some time, it'd probably worth looking at this again; maybe you'll get different numbers than I did.

I was adjusting it based on a the asm output from the JIT; however that has improved considerably in 5.0 so trade off may have changed. Also the Dictionary in the core find is a ref return and its Entry struct is larger which may be factors?

{
Debug.Assert(!comparer.Equals(_slots[i].value, value));
// If we hit the collision threshold we'll need to switch to the comparer which is using randomized string hashing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have dictionary tests that go down the "secure string comparer" paths? I thought we did but can't immediately find them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any either. It's probably worth adding some, but it'd also take a little work to find collisions. Maybe @GrabYourPitchforks has some in his back pocket :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should read "comparer that uses..." instead of "comparer which is using..." (I know this is a copy-paste).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (@maryamariyan?) tested it and I recall it didn't take looping terribly long to find 100 collisions but I guess we didn't check them in...

@danmoseley
Copy link
Member

cc @eanova -- fyi we took another run at your #1989 😄

Copy link
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read it all and seems good to me.

@danmoseley
Copy link
Member

I just noticed src\libraries\System.Linq\src\System\Linq\Set.cs which seems to be a lightweight HashSet optimized for no-add-after-remove. I do not know how much we care about the perf of it, but it is missing some of these optimizations (eg fastmod) and does not use prime size for its buckets array, which seems like a mistake (?)

@stephentoub
Copy link
Member Author

Read it all and seems good to me.

Thanks for reviewing.

@stephentoub
Copy link
Member Author

A few legitimate failures in the immutable collection tests; taking a look...

And factor out InsertionBehavior into its own file
This effectively deletes HashSet's data structure and replaces it with the one used by Dictionary, then updated for the differences (e.g. just a value rather than a key and a value).  HashSet used to have the same implementation, but Dictionary has evolved significantly and HashSet hasn't; this brings them to basic parity on implementation.

Based on perf tests, I veered away from Dictionary's implementation in a few places (e.g. a goto-based implementation in the core find method led to a significant regression for Int32-based Contains operations), and we should follow-up to understand whether Dictionary should be changed as well, or why there's a difference between the two.

Functionally, bringing over Dictionary's implementation yields a few notable changes, namely that Remove and Clear no longer invalidate enumerations.  The tests have been updated accordingly.
@stephentoub stephentoub merged commit 262948a into dotnet:master May 31, 2020
@stephentoub stephentoub deleted the hashsetperf branch May 31, 2020 22:48
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use NonRandomizedStringEqualityComparer in HashSet
6 participants