possible perf optimizations for System.Linq.Set & System.Linq.Parallel.Set #47173

FyiurAmron · 2021-01-19T17:03:54Z

As in title - I think it'd be worthwhile to at least investigate possible possible performance optimizations for System.Linq.Set (and System.Linq.Parallel.Set also, possibly).

Description

As mentioned by @danmosemsft in #37180 (comment) , there's a possible performance improvement by aligning it with improvements already present in HashSet/Dictionary.

Data

Things/differences that do come to mind here as possible optimizations:

creating default Comparer and using it even if none is needed/provided instead of branching on the comparer == null prevents both using the potentially faster value.GetHashCode() instead of InternalGetHashCode(value) which further delegates to _comparer.GetHashCode(value) (which, in this case, basically just calls value.GetHashCode()) (also, please note the EqualityComparer<T>.Default.Equals doesn't devirtualize in a shared generic #10050 , which can currently further hurt performance)
bucket index (hashCode % _buckets.Length) is calculated twice (with no caching) in Find/Add, even if no bucket resize did happen, and doesn't have any provision for https://github.com/stephentoub/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/HashSet.cs#L305
doesn't use HashHelpers.ExpandPrime(_count) etc., so it may have noticeably higher collision rate, impacting the performance.
doesn't take into account the size of source collection, in cases like https://github.com/dotnet/runtime/blob/master/src/libraries/System.Linq/src/System/Linq/Distinct.cs#L43 , which I suppose was intended to save memory. It will, OTOH, cause tons of new & Array.copy reallocs for iterating any collection with non-trivial size. The new size of _count * 2 + 1 is a smaller number than one you get from using prime series (which will usually go slightly above this number), which further increases the amount of allocs. While it's reasonable to assume that Distinct will be iterated less than a size of the original collection, using a more reasonable initial threshold than simply 7 (e.g. Math.clamp(NextPrime(initialCollectionSize / 4), 7, 131), or any sensible ballpark figure along this line), would decrease the amount of early allocs significantly without allocating too much bucket memory in advance.

Analysis

Although the class is only used internally and not meant for general consumption, it's still possibly worthwhile to provide speed optimizations for it, as it seems it was intended to be heavily optimized instance for quick behind-the-scenes data lifting, may create minor performance bottlenecks in LINQ contexts (e.g. when using https://docs.microsoft.com/en-us/dotnet/api/system.linq.enumerable.distinct?view=net-5.0), especially when used in worst-case scenarios (large number of consecutive longer iteration sequences, resulting in lots of reallocs/hash calculations).

The text was updated successfully, but these errors were encountered:

ghost · 2021-01-19T17:03:59Z

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

As in title - I think it'd be worthwhile to at least investigate possible possible performance optimizations for System.Linq.Set.

Description

As mentioned by @danmosemsft in #37180 (comment) , there's a possible performance improvement by aligning it with improvements already present in HashSet/Dictionary.

Data

Things/differences that do come to mind here as possible optimizations:

creating default Comparer and using it even if none is needed/provided instead of branching on the prevents both using the potentially faster value.GetHashCode() instead of InternalGetHashCode(value) which further delegates to _comparer.GetHashCode(value) (which, in this case, basicallt just calls value.GetHashCode()) (also, please note the EqualityComparer<T>.Default.Equals doesn't devirtualize in a shared generic #10050 , which can currently further hurt performance)
bucket index (hashCode % _buckets.Length) is calculated twice (with no caching) in Find/Add, even if no bucket resize did happen, and doesn't have any provision for https://github.com/stephentoub/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/HashSet.cs#L305
doesn't use HashHelpers.ExpandPrime(_count) etc., so it may have noticeably higher collision rate, impacting the performance.
doesn't take into account the size of source collection, in cases like https://github.com/dotnet/runtime/blob/master/src/libraries/System.Linq/src/System/Linq/Distinct.cs#L43 , which I suppose was intended to save memory. It will, OTOH, cause tons of new & Array.copy reallocs for iterating any collection with non-trivial size. Setting the new size to _count * 2 + 1 provides a smaller number than using prime series (which will usually go slightly above this number), which further increases the amount of allocs. While it's reasonable to assume that Distinct will be iterated less than a size of the original collection, using a more reasonable initial threshold than simply 7 (e.g. Math.clamp(NextPrime(initialCollectionSize / 4), 7, 131), or any sensible ballpark figure along this line, would decrease the amount of early allocs significantly without allocating too much bucket memory in advance.

Analysis

Although the class is only used internally and not meant for general consumption, it's still possibly worthwhile to provide speed optimizations for it, as it seems it was intended to be heavily optimized instance for quick behind-the-scenes data lifting, may create minor performance bottlenecks in LINQ contexts. (e.g. when using https://docs.microsoft.com/en-us/dotnet/api/system.linq.enumerable.distinct?view=net-5.0)

Author:	FyiurAmron
Assignees:	-
Labels:	`area-System.Linq`, `tenet-performance`, `untriaged`
Milestone:	-

eiriktsarpalis · 2021-01-20T14:29:50Z

Hi @FyiurAmron would you be interested in providing a prototype to evaluate potential improvements?

FyiurAmron · 2021-01-21T17:44:53Z

@eiriktsarpalis what benchmark/microbenchmark framework would you suggest to measure the results? Is https://github.com/dotnet/BenchmarkDotNet OK, or do you have any other internal/external tooling that would be of use here?

eiriktsarpalis · 2021-01-21T17:52:54Z

We use BenchmarkDotNet as well, cf. https://github.com/dotnet/performance

FyiurAmron · 2021-01-22T19:36:11Z

@eiriktsarpalis I've created a quick project @ https://github.com/FyiurAmron/LinqSetPerf to show the range of possible perf improvements (currently from only the 4th item, taking the source size into account); I've refactored the code to be as isolated as possible, and modified basically just two places: System.Linq.Set c-tor to use an optional initial size, and DistinctIterator.FillSet() to check if _source has Count easily accessible (ie. is ICollection<TSource>) and use it if possible for the initial alloc. While this could, worst (degenerate) case scenario, allocate n memory max instead of 7 (current minimal alloc), assuming the source is just a single element repeated (without any real speed loss anyway) in "best"-case scenario (from about 50% to all elements are distinct) the final allocated memory will be the same, with big perf boost and a lot less allocs in the meantime. Setting some reasonable upper bound for the initial alloc size should help deal with degenerate cases.

This of course only deals with Distinct; Except , Intersect & Union have the same deficiency in their new Set<TSource>() calls. Arguably, for small data sets, having a higher initial alloc is never a real problem, but the speed penalty is (ops on small data sets are expected to be run repeatedly). I'd say that for "big" data sets (> 1MiB, i.e. ca. 100k elems?) being a more conservative with initial alloc would be sensible, since hitting a mem limit (or just abusing the mem needlessly) is possible.

Also, FWIW, 7 as an initial size is extremely low initial alloc anyway. For general-case HashMap, it may be sensible, since the first reallocs are quite fast, and expected growth is small. Here, not so much.

for a random int[] with different sizes: (CoreLinqSet calls the stdlib directly, provided only for rough reference; OldSet is the very same implementation hoisted out of the stdlib to user code, NewSet is the new implementation - those two are to be compared directly)

size = 260

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	7.070 us	0.0848 us	0.0793 us	5.6458	-	-	17.33 KB
OldSet	6.821 us	0.0824 us	0.0688 us	5.6458	-	-	17.33 KB
NewSet	4.805 us	0.0321 us	0.0300 us	1.7242	-	-	5.29 KB

size = 1030

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	30.41 us	0.068 us	0.063 us	22.2168	-	-	68.41 KB
OldSet	29.35 us	0.060 us	0.056 us	22.2168	-	-	68.41 KB
NewSet	20.33 us	0.041 us	0.036 us	6.6223	-	-	20.33 KB

size = 200000

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	7.670 ms	0.0188 ms	0.0176 ms	1062.5000	1007.8125	992.1875	8.76 MB
OldSet	7.410 ms	0.0137 ms	0.0121 ms	1062.5000	1007.8125	992.1875	8.76 MB
NewSet	6.426 ms	0.1018 ms	0.0953 ms	656.2500	656.2500	656.2500	3.82 MB

I'd say that the expected performance gains for those cases are high to very high, both in CPU & mem.

Now, for the not-so-distinct arrays: (forced 4x repetition of elements)

size = 260

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	3.920 us	0.0060 us	0.0056 us	1.4648	-	-	4.49 KB
OldSet	3.801 us	0.0051 us	0.0045 us	1.4648	-	-	4.49 KB
NewSet	3.560 us	0.0027 us	0.0024 us	1.4763	-	-	4.53 KB

size = 1030

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	16.42 us	0.018 us	0.017 us	5.6458	-	-	17.32 KB
OldSet	15.92 us	0.028 us	0.026 us	5.6458	-	-	17.32 KB
NewSet	13.79 us	0.030 us	0.028 us	5.6458	-	-	17.31 KB

size = 200000

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	3.572 ms	0.0329 ms	0.0307 ms	570.3125	515.6250	500.0000	2.19 MB
OldSet	3.371 ms	0.0081 ms	0.0072 ms	570.3125	515.6250	500.0000	2.19 MB
NewSet	3.113 ms	0.0125 ms	0.0104 ms	203.1250	203.1250	203.1250	3.24 MB

A slight CPU speed increase, with non-obvious mem differences, depending on exact dataset. Still, I'd say that a set where only 25% of the 200 000 elements are distinct is somewhat of a ballpark "edge" for non-degenerate data.

Finally, a completely degenerate case (0-filled array)

size = 260

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	3.094 us	0.0052 us	0.0048 us	0.1068	-	-	344 B
OldSet	3.160 us	0.0048 us	0.0042 us	0.1068	-	-	344 B
NewSet	3.225 us	0.0051 us	0.0048 us	1.3962	-	-	4384 B

size = 1030

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	11.92 us	0.013 us	0.012 us	0.1068	-	-	344 B
OldSet	12.26 us	0.028 us	0.023 us	0.1068	-	-	344 B
NewSet	12.52 us	0.023 us	0.020 us	5.3101	-	-	16704 B

size = 200 000

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	2.293 ms	0.0019 ms	0.0018 ms	-	-	-	344 B
OldSet	2.385 ms	0.0042 ms	0.0039 ms	-	-	-	344 B
NewSet	2.392 ms	0.0020 ms	0.0019 ms	996.0938	996.0938	996.0938	3200224 B

As expected, there's no statistically significant difference in speed here, but the mem difference is visible, and can get important if the size of the collection is high.

FWIW, I expect the first 3 items in the original perf improvement list to give only a small boost in speed, but without any real tradeoffs.

FyiurAmron · 2021-01-22T19:53:53Z

also, for reference, this is the speed difference with just using the actual HashSet instead of Distinct + internal Set (it can't be compared directly of course, but it shows the range of possible improvements nevertheless)

random data, size = 260

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	30.54 us	0.169 us	0.158 us	22.2168	-	-	68.41 KB
OldSet	29.40 us	0.518 us	0.459 us	22.2168	-	-	68.41 KB
NewSet	19.90 us	0.049 us	0.041 us	6.6223	-	-	20.33 KB
HashSet	14.31 us	0.040 us	0.038 us	6.9885	-	-	21.44 KB

4 reps of data in the set, size = 200 000

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CoreLinqSet	3.499 ms	0.0178 ms	0.0149 ms	570.3125	515.6250	500.0000	2.19 MB
OldSet	3.373 ms	0.0165 ms	0.0147 ms	570.3125	515.6250	500.0000	2.19 MB
NewSet	3.175 ms	0.0371 ms	0.0347 ms	289.0625	289.0625	289.0625	3.24 MB
HashSet	2.536 ms	0.0524 ms	0.1544 ms	988.2813	988.2813	988.2813	4.43 MB

stephentoub · 2021-03-15T13:08:16Z

Closed by #49591

FyiurAmron added the tenet-performance Performance related issue label Jan 19, 2021

dotnet-issue-labeler bot added area-System.Linq untriaged New issue has not been triaged by the area owner labels Jan 19, 2021

eiriktsarpalis self-assigned this Jan 29, 2021

stephentoub mentioned this issue Mar 12, 2021

Replace LINQ's custom Set with HashSet #49543

Closed

ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Mar 12, 2021

stephentoub closed this as completed Mar 15, 2021

ghost locked as resolved and limited conversation to collaborators Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible perf optimizations for System.Linq.Set & System.Linq.Parallel.Set #47173

possible perf optimizations for System.Linq.Set & System.Linq.Parallel.Set #47173

FyiurAmron commented Jan 19, 2021 •

edited

Loading

ghost commented Jan 19, 2021

Description

Data

Analysis

eiriktsarpalis commented Jan 20, 2021

FyiurAmron commented Jan 21, 2021

eiriktsarpalis commented Jan 21, 2021

FyiurAmron commented Jan 22, 2021 •

edited

Loading

FyiurAmron commented Jan 22, 2021

stephentoub commented Mar 15, 2021

possible perf optimizations for System.Linq.Set & System.Linq.Parallel.Set #47173

possible perf optimizations for System.Linq.Set & System.Linq.Parallel.Set #47173

Comments

FyiurAmron commented Jan 19, 2021 • edited Loading

Description

Data

Analysis

ghost commented Jan 19, 2021

Description

Data

Analysis

eiriktsarpalis commented Jan 20, 2021

FyiurAmron commented Jan 21, 2021

eiriktsarpalis commented Jan 21, 2021

FyiurAmron commented Jan 22, 2021 • edited Loading

FyiurAmron commented Jan 22, 2021

stephentoub commented Mar 15, 2021

FyiurAmron commented Jan 19, 2021 •

edited

Loading

FyiurAmron commented Jan 22, 2021 •

edited

Loading