Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Enumerable.Distinct<T> performance like HashSet<T> #42760

Closed
NetMage opened this issue Sep 25, 2020 · 6 comments
Closed

Improve Enumerable.Distinct<T> performance like HashSet<T> #42760

NetMage opened this issue Sep 25, 2020 · 6 comments
Assignees
Labels
area-System.Linq tenet-performance Performance related issue
Milestone

Comments

@NetMage
Copy link

NetMage commented Sep 25, 2020

Description

The improvements to HashSet means that the light weight Set internal class used by Distinct is slower than implementing your own Distinct with HashSet. I suggest that Set be improved similarly, or that it be replaced with HashSet.

Configuration

.Net Core 5.0.0-rc.1.20451.14
Windows 10 v1909
x64

Data

Here are some rough benchmarks comparing the performance of Enumerable.Distinct with MyDistinct which basically uses a copy of DistinctIterator with Set replaced with HashSet and MyDistinct2 which is

public static IEnumerable<T> MyDistinct2<T>(this IEnumerable<T> items, IEqualityComparer<T> comparer = null) {
    foreach (var item in items.ToHashSet(comparer))
        yield return item;
}

Here are the benchmark results. They were run over millions of random ints from 1 to 105 for Lots of collisions, 1 to int.MaxValue for Few collisions, using ToList then Select(x => x). The ToList was to pre-compute the ToString conversions for the string and object tests (which was done by casting the string to object). The Select was to prevent any IList optimizations.

As can be seen, the library Distinct method is always slowest, with MyDistinct2 either fastest or close to MyDistinct for fastest. I am not sure why foreach is so much faster than DistinctIterator for int.

Timings (ms) Type | Collisions
int object (string) string
Method Few Ratio Lots Ratio Few Ratio Lots Ratio Few Ratio Lots Ratio
Distinct 2,680 1.29 1,891 1.32 3,289 1.07 3,985 1.12 3,291 1.05 5,118 1.54
MyDistinct 2,549 1.23 1,453 1.02 3,082 1.00 3,686 1.04 3,147 1.00 3,334 1.00
MyDistinct2 2,071 1.00 1,429 1.00 3,189 1.03 3,556 1.00 3,331 1.06 3,525 1.06

Analysis

I believe the Set class Add method should be modified to be similar to the HashSet.AddIfNotPresent().

@NetMage NetMage added the tenet-performance Performance related issue label Sep 25, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Sep 25, 2020
@ghost
Copy link

ghost commented Sep 25, 2020

Tagging subscribers to this area: @eiriktsarpalis, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@En3Tho
Copy link
Contributor

En3Tho commented Sep 27, 2020

The code of your benchmarks will prove that you're doing it right much more than a thousand words :) Also, can you please convert it to use BenchMarkDotNet?

@ghost
Copy link

ghost commented Sep 28, 2020

Tagging subscribers to this area: @eiriktsarpalis, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Sep 29, 2020
@stephentoub stephentoub added this to the Future milestone Sep 29, 2020
@adamsitnik
Copy link
Member

Hi @NetMage

Thank you for your proposal. I've given it a try and measured using the following benchmarks:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Collections.Generic;
using System.Linq;

namespace SetOrHashSet
{
    class Program
    {
        static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }

    [GenericTypeArguments(typeof(int))]
    [GenericTypeArguments(typeof(string))]
    public class BenchmarksRefactor<T>
    {
        [Params(10, 1000)]
        public int Size { get; set; }

        [Params(true, false)]
        public bool Unique { get; set; }

        private T[] _array;

        [GlobalSetup]
        public void Setup()
        {
            if (typeof(T) == typeof(int) && Unique)
                _array = (T[])(object)(Enumerable.Range(0, Size).ToArray());
            else if (typeof(T) == typeof(int) && !Unique)
                _array = (T[])(object)Enumerable.Range(0, Size / 2).Concat(Enumerable.Range(0, Size / 2)).ToArray();
            else if (typeof(T) == typeof(string) && Unique)
                _array = (T[])(object)Enumerable.Range(0, Size).Select(x => x.ToString()).ToArray();
            else if (typeof(T) == typeof(string) && !Unique)
                _array = (T[])(object)Enumerable.Range(0, Size / 2).Concat(Enumerable.Range(0, Size / 2)).Select(x => x.ToString()).ToArray();
        }

        [Benchmark]
        public void Enumerate()
        {
            foreach (var item in _array.Distinct())
            {
            }
        }

        [Benchmark]
        public int Count() => _array.Distinct().Count();

        [Benchmark]
        public T[] ToArray() => _array.Distinct().ToArray();

        [Benchmark]
        public List<T> ToList() => _array.Distinct().ToList();
    }
}

The results I got:

For Int32:

Method Toolchain Size Unique Mean Ratio Gen 0 Allocated
Enumerate \after\CoreRun.exe 10 False 268.5 ns 1.05 0.0551 440 B
Enumerate \before\CoreRun.exe 10 False 257.6 ns 1.00 0.0394 312 B
Count \after\CoreRun.exe 10 False 227.7 ns 1.10 0.0557 440 B
Count \before\CoreRun.exe 10 False 207.4 ns 1.00 0.0394 312 B
ToArray \after\CoreRun.exe 10 False 252.8 ns 1.19 0.0618 488 B
ToArray \before\CoreRun.exe 10 False 211.9 ns 1.00 0.0458 360 B
ToList \after\CoreRun.exe 10 False 255.8 ns 1.12 0.0659 520 B
ToList \before\CoreRun.exe 10 False 227.7 ns 1.00 0.0495 392 B
Enumerate \after\CoreRun.exe 10 True 358.8 ns 1.05 0.0973 768 B
Enumerate \before\CoreRun.exe 10 True 342.4 ns 1.00 0.0766 608 B
Count \after\CoreRun.exe 10 True 301.8 ns 1.03 0.0975 768 B
Count \before\CoreRun.exe 10 True 293.2 ns 1.00 0.0764 608 B
ToArray \after\CoreRun.exe 10 True 336.7 ns 1.10 0.1053 832 B
ToArray \before\CoreRun.exe 10 True 305.2 ns 1.00 0.0848 672 B
ToList \after\CoreRun.exe 10 True 343.7 ns 1.09 0.1100 864 B
ToList \before\CoreRun.exe 10 True 315.0 ns 1.00 0.0896 704 B
Enumerate \after\CoreRun.exe 1000 False 17,882.7 ns 0.86 3.5242 27816 B
Enumerate \before\CoreRun.exe 1000 False 20,784.8 ns 1.00 2.0973 16680 B
Count \after\CoreRun.exe 1000 False 15,013.0 ns 0.83 3.4957 27816 B
Count \before\CoreRun.exe 1000 False 18,011.9 ns 1.00 2.0857 16680 B
ToArray \after\CoreRun.exe 1000 False 15,911.8 ns 0.86 3.7513 29840 B
ToArray \before\CoreRun.exe 1000 False 18,536.5 ns 1.00 2.3697 18704 B
ToList \after\CoreRun.exe 1000 False 15,634.4 ns 0.82 3.7613 29872 B
ToList \before\CoreRun.exe 1000 False 18,954.0 ns 1.00 2.3428 18736 B
Enumerate \after\CoreRun.exe 1000 True 25,926.9 ns 0.97 7.4751 58768 B
Enumerate \before\CoreRun.exe 1000 True 26,834.9 ns 1.00 4.1881 33104 B
Count \after\CoreRun.exe 1000 True 20,597.7 ns 0.96 7.4013 58768 B
Count \before\CoreRun.exe 1000 True 21,569.4 ns 1.00 4.1379 33104 B
ToArray \after\CoreRun.exe 1000 True 22,178.4 ns 0.98 7.9787 62792 B
ToArray \before\CoreRun.exe 1000 True 22,641.2 ns 1.00 4.6465 37128 B
ToList \after\CoreRun.exe 1000 True 22,226.4 ns 0.93 7.9225 62824 B
ToList \before\CoreRun.exe 1000 True 23,824.6 ns 1.00 4.6899 37160 B

For string:

Method Toolchain Size Unique Mean Ratio Gen 1 Allocated
Enumerate \after\CoreRun.exe 10 False 363.7 ns 1.04 - 472 B
Enumerate \before\CoreRun.exe 10 False 348.9 ns 1.00 - 336 B
Count \after\CoreRun.exe 10 False 330.1 ns 1.00 - 472 B
Count \before\CoreRun.exe 10 False 329.4 ns 1.00 - 336 B
ToArray \after\CoreRun.exe 10 False 366.5 ns 1.00 - 536 B
ToArray \before\CoreRun.exe 10 False 365.1 ns 1.00 - 400 B
ToList \after\CoreRun.exe 10 False 379.7 ns 1.02 - 568 B
ToList \before\CoreRun.exe 10 False 373.3 ns 1.00 - 432 B
Enumerate \after\CoreRun.exe 10 True 479.2 ns 1.06 - 864 B
Enumerate \before\CoreRun.exe 10 True 452.9 ns 1.00 - 688 B
Count \after\CoreRun.exe 10 True 412.6 ns 0.94 - 864 B
Count \before\CoreRun.exe 10 True 441.1 ns 1.00 - 688 B
ToArray \after\CoreRun.exe 10 True 492.3 ns 1.05 - 968 B
ToArray \before\CoreRun.exe 10 True 466.0 ns 1.00 - 792 B
ToList \after\CoreRun.exe 10 True 481.8 ns 1.04 - 1000 B
ToList \before\CoreRun.exe 10 True 461.5 ns 1.00 - 824 B
Enumerate \after\CoreRun.exe 1000 False 32,611.8 ns 0.80 0.2599 34584 B
Enumerate \before\CoreRun.exe 1000 False 40,815.2 ns 1.00 - 20688 B
Count \after\CoreRun.exe 1000 False 29,022.4 ns 0.75 0.2311 34584 B
Count \before\CoreRun.exe 1000 False 38,587.0 ns 1.00 - 20688 B
ToArray \after\CoreRun.exe 1000 False 30,860.3 ns 0.77 0.4032 38608 B
ToArray \before\CoreRun.exe 1000 False 40,318.0 ns 1.00 0.1603 24712 B
ToList \after\CoreRun.exe 1000 False 31,038.2 ns 0.76 0.3684 38640 B
ToList \before\CoreRun.exe 1000 False 40,649.5 ns 1.00 - 24744 B
Enumerate \after\CoreRun.exe 1000 True 42,554.2 ns 0.89 1.3587 73256 B
Enumerate \before\CoreRun.exe 1000 True 48,013.5 ns 1.00 0.3834 41200 B
Count \after\CoreRun.exe 1000 True 36,028.8 ns 0.84 1.4501 73256 B
Count \before\CoreRun.exe 1000 True 42,599.2 ns 1.00 0.3406 41200 B
ToArray \after\CoreRun.exe 1000 True 40,025.7 ns 0.90 1.6108 81280 B
ToArray \before\CoreRun.exe 1000 True 44,490.9 ns 1.00 0.5282 49224 B
ToList \after\CoreRun.exe 1000 True 39,923.2 ns 0.88 1.6846 81312 B
ToList \before\CoreRun.exe 1000 True 45,456.4 ns 1.00 0.5435 49256 B

If we switch to HashSet then the Distinct method:

  • for small inputs (like 10 elements above):
    • slows down, the regression is from 0 to 19% (see the Ratio column above)
    • allocates 30% more memory on average
  • for bigger inputs (like 1000 elements above)
    • speeds up, the improvement is from 3% to 25%
    • allocates even 65% more memory and GC collects memory up to three times more frequently.

If it was just about the CPU regression for small inputs, we could think about changing it. But 65% increase in allocated memory is not worth it. Having said that, I am going to close the issue.

Thanks,
Adam

@adamsitnik adamsitnik self-assigned this Nov 4, 2020
@GSPP
Copy link

GSPP commented Nov 4, 2020

I suggest that Set be improved similarly, or that it be replaced with HashSet.

Can we improve that internal class? Surely, HashSet is doing something right. Maybe that's because it received optimizations that the internal data structure does not have.

@adamsitnik
Copy link
Member

Can we improve that internal class? Surely, HashSet is doing something right. Maybe that's because it received optimizations that the internal data structure does not have.

Adding optimizations typically increases code complexity. This needs to be justified using measurements from real-world applications.

So far I've never observed the internal Set<T> type to be hot in any profiles so I don't think that we (.NET Team) should optimize it.

However, if someone is willing to study the optimizations applied to HashSet<T> and contributing them to Set<T> I am fine with that as long as it does not increase the complexity too much or is well justified.

Here you can find the docs that describe how to profile and benchmark local dotnet runtime builds:
https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md
https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Linq tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants