Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible perf optimizations for System.Linq.Set & System.Linq.Parallel.Set #47173

Closed
FyiurAmron opened this issue Jan 19, 2021 · 7 comments
Closed
Assignees
Labels
area-System.Linq tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner

Comments

@FyiurAmron
Copy link
Contributor

FyiurAmron commented Jan 19, 2021

As in title - I think it'd be worthwhile to at least investigate possible possible performance optimizations for System.Linq.Set (and System.Linq.Parallel.Set also, possibly).

Description

As mentioned by @danmosemsft in #37180 (comment) , there's a possible performance improvement by aligning it with improvements already present in HashSet/Dictionary.

Data

Things/differences that do come to mind here as possible optimizations:

  • creating default Comparer and using it even if none is needed/provided instead of branching on the comparer == null prevents both using the potentially faster value.GetHashCode() instead of InternalGetHashCode(value) which further delegates to _comparer.GetHashCode(value) (which, in this case, basically just calls value.GetHashCode()) (also, please note the EqualityComparer<T>.Default.Equals doesn't devirtualize in a shared generic #10050 , which can currently further hurt performance)
  • bucket index (hashCode % _buckets.Length) is calculated twice (with no caching) in Find/Add, even if no bucket resize did happen, and doesn't have any provision for https://github.com/stephentoub/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/HashSet.cs#L305
  • doesn't use HashHelpers.ExpandPrime(_count) etc., so it may have noticeably higher collision rate, impacting the performance.
  • doesn't take into account the size of source collection, in cases like https://github.com/dotnet/runtime/blob/master/src/libraries/System.Linq/src/System/Linq/Distinct.cs#L43 , which I suppose was intended to save memory. It will, OTOH, cause tons of new & Array.copy reallocs for iterating any collection with non-trivial size. The new size of _count * 2 + 1 is a smaller number than one you get from using prime series (which will usually go slightly above this number), which further increases the amount of allocs. While it's reasonable to assume that Distinct will be iterated less than a size of the original collection, using a more reasonable initial threshold than simply 7 (e.g. Math.clamp(NextPrime(initialCollectionSize / 4), 7, 131), or any sensible ballpark figure along this line), would decrease the amount of early allocs significantly without allocating too much bucket memory in advance.

Analysis

Although the class is only used internally and not meant for general consumption, it's still possibly worthwhile to provide speed optimizations for it, as it seems it was intended to be heavily optimized instance for quick behind-the-scenes data lifting, may create minor performance bottlenecks in LINQ contexts (e.g. when using https://docs.microsoft.com/en-us/dotnet/api/system.linq.enumerable.distinct?view=net-5.0), especially when used in worst-case scenarios (large number of consecutive longer iteration sequences, resulting in lots of reallocs/hash calculations).

@FyiurAmron FyiurAmron added the tenet-performance Performance related issue label Jan 19, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Linq untriaged New issue has not been triaged by the area owner labels Jan 19, 2021
@ghost
Copy link

ghost commented Jan 19, 2021

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

As in title - I think it'd be worthwhile to at least investigate possible possible performance optimizations for System.Linq.Set.

Description

As mentioned by @danmosemsft in #37180 (comment) , there's a possible performance improvement by aligning it with improvements already present in HashSet/Dictionary.

Data

Things/differences that do come to mind here as possible optimizations:

  • creating default Comparer and using it even if none is needed/provided instead of branching on the prevents both using the potentially faster value.GetHashCode() instead of InternalGetHashCode(value) which further delegates to _comparer.GetHashCode(value) (which, in this case, basicallt just calls value.GetHashCode()) (also, please note the EqualityComparer<T>.Default.Equals doesn't devirtualize in a shared generic #10050 , which can currently further hurt performance)
  • bucket index (hashCode % _buckets.Length) is calculated twice (with no caching) in Find/Add, even if no bucket resize did happen, and doesn't have any provision for https://github.com/stephentoub/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/HashSet.cs#L305
  • doesn't use HashHelpers.ExpandPrime(_count) etc., so it may have noticeably higher collision rate, impacting the performance.
  • doesn't take into account the size of source collection, in cases like https://github.com/dotnet/runtime/blob/master/src/libraries/System.Linq/src/System/Linq/Distinct.cs#L43 , which I suppose was intended to save memory. It will, OTOH, cause tons of new & Array.copy reallocs for iterating any collection with non-trivial size. Setting the new size to _count * 2 + 1 provides a smaller number than using prime series (which will usually go slightly above this number), which further increases the amount of allocs. While it's reasonable to assume that Distinct will be iterated less than a size of the original collection, using a more reasonable initial threshold than simply 7 (e.g. Math.clamp(NextPrime(initialCollectionSize / 4), 7, 131), or any sensible ballpark figure along this line, would decrease the amount of early allocs significantly without allocating too much bucket memory in advance.

Analysis

Although the class is only used internally and not meant for general consumption, it's still possibly worthwhile to provide speed optimizations for it, as it seems it was intended to be heavily optimized instance for quick behind-the-scenes data lifting, may create minor performance bottlenecks in LINQ contexts. (e.g. when using https://docs.microsoft.com/en-us/dotnet/api/system.linq.enumerable.distinct?view=net-5.0)

Author: FyiurAmron
Assignees: -
Labels:

area-System.Linq, tenet-performance, untriaged

Milestone: -

@eiriktsarpalis
Copy link
Member

Hi @FyiurAmron would you be interested in providing a prototype to evaluate potential improvements?

@FyiurAmron
Copy link
Contributor Author

@eiriktsarpalis what benchmark/microbenchmark framework would you suggest to measure the results? Is https://github.com/dotnet/BenchmarkDotNet OK, or do you have any other internal/external tooling that would be of use here?

@eiriktsarpalis
Copy link
Member

We use BenchmarkDotNet as well, cf. https://github.com/dotnet/performance

@FyiurAmron
Copy link
Contributor Author

FyiurAmron commented Jan 22, 2021

@eiriktsarpalis I've created a quick project @ https://github.com/FyiurAmron/LinqSetPerf to show the range of possible perf improvements (currently from only the 4th item, taking the source size into account); I've refactored the code to be as isolated as possible, and modified basically just two places: System.Linq.Set c-tor to use an optional initial size, and DistinctIterator.FillSet() to check if _source has Count easily accessible (ie. is ICollection<TSource>) and use it if possible for the initial alloc. While this could, worst (degenerate) case scenario, allocate n memory max instead of 7 (current minimal alloc), assuming the source is just a single element repeated (without any real speed loss anyway) in "best"-case scenario (from about 50% to all elements are distinct) the final allocated memory will be the same, with big perf boost and a lot less allocs in the meantime. Setting some reasonable upper bound for the initial alloc size should help deal with degenerate cases.

This of course only deals with Distinct; Except , Intersect & Union have the same deficiency in their new Set<TSource>() calls. Arguably, for small data sets, having a higher initial alloc is never a real problem, but the speed penalty is (ops on small data sets are expected to be run repeatedly). I'd say that for "big" data sets (> 1MiB, i.e. ca. 100k elems?) being a more conservative with initial alloc would be sensible, since hitting a mem limit (or just abusing the mem needlessly) is possible.

Also, FWIW, 7 as an initial size is extremely low initial alloc anyway. For general-case HashMap, it may be sensible, since the first reallocs are quite fast, and expected growth is small. Here, not so much.

for a random int[] with different sizes: (CoreLinqSet calls the stdlib directly, provided only for rough reference; OldSet is the very same implementation hoisted out of the stdlib to user code, NewSet is the new implementation - those two are to be compared directly)

size = 260

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 7.070 us 0.0848 us 0.0793 us 5.6458 - - 17.33 KB
OldSet 6.821 us 0.0824 us 0.0688 us 5.6458 - - 17.33 KB
NewSet 4.805 us 0.0321 us 0.0300 us 1.7242 - - 5.29 KB

size = 1030

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 30.41 us 0.068 us 0.063 us 22.2168 - - 68.41 KB
OldSet 29.35 us 0.060 us 0.056 us 22.2168 - - 68.41 KB
NewSet 20.33 us 0.041 us 0.036 us 6.6223 - - 20.33 KB

size = 200000

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 7.670 ms 0.0188 ms 0.0176 ms 1062.5000 1007.8125 992.1875 8.76 MB
OldSet 7.410 ms 0.0137 ms 0.0121 ms 1062.5000 1007.8125 992.1875 8.76 MB
NewSet 6.426 ms 0.1018 ms 0.0953 ms 656.2500 656.2500 656.2500 3.82 MB

I'd say that the expected performance gains for those cases are high to very high, both in CPU & mem.

Now, for the not-so-distinct arrays: (forced 4x repetition of elements)

size = 260

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 3.920 us 0.0060 us 0.0056 us 1.4648 - - 4.49 KB
OldSet 3.801 us 0.0051 us 0.0045 us 1.4648 - - 4.49 KB
NewSet 3.560 us 0.0027 us 0.0024 us 1.4763 - - 4.53 KB

size = 1030

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 16.42 us 0.018 us 0.017 us 5.6458 - - 17.32 KB
OldSet 15.92 us 0.028 us 0.026 us 5.6458 - - 17.32 KB
NewSet 13.79 us 0.030 us 0.028 us 5.6458 - - 17.31 KB

size = 200000

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 3.572 ms 0.0329 ms 0.0307 ms 570.3125 515.6250 500.0000 2.19 MB
OldSet 3.371 ms 0.0081 ms 0.0072 ms 570.3125 515.6250 500.0000 2.19 MB
NewSet 3.113 ms 0.0125 ms 0.0104 ms 203.1250 203.1250 203.1250 3.24 MB

A slight CPU speed increase, with non-obvious mem differences, depending on exact dataset. Still, I'd say that a set where only 25% of the 200 000 elements are distinct is somewhat of a ballpark "edge" for non-degenerate data.

Finally, a completely degenerate case (0-filled array)

size = 260

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 3.094 us 0.0052 us 0.0048 us 0.1068 - - 344 B
OldSet 3.160 us 0.0048 us 0.0042 us 0.1068 - - 344 B
NewSet 3.225 us 0.0051 us 0.0048 us 1.3962 - - 4384 B

size = 1030

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 11.92 us 0.013 us 0.012 us 0.1068 - - 344 B
OldSet 12.26 us 0.028 us 0.023 us 0.1068 - - 344 B
NewSet 12.52 us 0.023 us 0.020 us 5.3101 - - 16704 B

size = 200 000

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 2.293 ms 0.0019 ms 0.0018 ms - - - 344 B
OldSet 2.385 ms 0.0042 ms 0.0039 ms - - - 344 B
NewSet 2.392 ms 0.0020 ms 0.0019 ms 996.0938 996.0938 996.0938 3200224 B

As expected, there's no statistically significant difference in speed here, but the mem difference is visible, and can get important if the size of the collection is high.

FWIW, I expect the first 3 items in the original perf improvement list to give only a small boost in speed, but without any real tradeoffs.

@FyiurAmron
Copy link
Contributor Author

also, for reference, this is the speed difference with just using the actual HashSet instead of Distinct + internal Set (it can't be compared directly of course, but it shows the range of possible improvements nevertheless)

random data, size = 260

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 30.54 us 0.169 us 0.158 us 22.2168 - - 68.41 KB
OldSet 29.40 us 0.518 us 0.459 us 22.2168 - - 68.41 KB
NewSet 19.90 us 0.049 us 0.041 us 6.6223 - - 20.33 KB
HashSet 14.31 us 0.040 us 0.038 us 6.9885 - - 21.44 KB

4 reps of data in the set, size = 200 000

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CoreLinqSet 3.499 ms 0.0178 ms 0.0149 ms 570.3125 515.6250 500.0000 2.19 MB
OldSet 3.373 ms 0.0165 ms 0.0147 ms 570.3125 515.6250 500.0000 2.19 MB
NewSet 3.175 ms 0.0371 ms 0.0347 ms 289.0625 289.0625 289.0625 3.24 MB
HashSet 2.536 ms 0.0524 ms 0.1544 ms 988.2813 988.2813 988.2813 4.43 MB

@eiriktsarpalis eiriktsarpalis self-assigned this Jan 29, 2021
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Mar 12, 2021
@stephentoub
Copy link
Member

Closed by #49591

@ghost ghost locked as resolved and limited conversation to collaborators Apr 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Linq tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants