Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Linq CountBy method #77716

Closed
epeshk opened this issue Nov 1, 2022 · 22 comments · Fixed by #91507
Closed

[API Proposal]: Linq CountBy method #77716

epeshk opened this issue Nov 1, 2022 · 22 comments · Fixed by #91507
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq
Milestone

Comments

@epeshk
Copy link
Contributor

epeshk commented Nov 1, 2022

Draft implementation: alternative design 2

Background and motivation

Currently, it is not so easy to count occurrences of IEnumerable<T> elements with existing LINQ methods in C#.

var arr = new[]{1, 2, 2, 3, 3, 3};

Dictionary<int, int> counts = arr
  .GroupBy(x => x)
  .ToDictionary(x => x.Key, x => x.Count());

Motivation

  • It is not efficient to use GroupBy for counting because it materializes Grouping objects with a list of values related to each key.
  • It is not intuitive that .Count() will not enumerate the sequence again.
  • Other popular languages provide simpler API for this

Other language examples

F#:

let arr = [|1; 2; 2; 3; 3; 3|]
let counts: seq<int * int> = arr |> Seq.countBy id

Python:

from collections import Counter

arr = [1, 2, 2, 3, 3, 3]
counts: Counter = Counter(arr)

Kotlin:

val arr = arrayOf(1, 2, 2, 3, 3, 3)
val counts: Map<Int, Int> = arr.groupingBy { it }.eachCount()

Benchmark

Benchmark results:

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19044.2130/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.2.22477.23
  [Host]     : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2
Method Mean Error StdDev Gen0 Gen1 Allocated
Grouping 1,031.7 us 18.61 us 17.41 us 60.5469 39.0625 1017.17 KB
LookupDict 18.919 ms 0.3766 ms 0.6790 ms 1343.7500 1312.5000 937.5000
Lookup 16.927 ms 0.3320 ms 0.4078 ms 625.0000 593.7500 218.7500
Counting 673.5 us 5.99 us 5.31 us - - 9.52 KB
CountingCast 635.3 us 2.42 us 2.26 us - - 7.25 KB

Benchmark for case without repetitions:
Seq = Enumerable.Range(0, SequenceLength).ToArray();

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
Grouping 18.902 ms 0.3639 ms 0.4469 ms 1343.7500 1312.5000 937.5000 10.22 MB
LookupDict 22.138 ms 0.4110 ms 0.4037 ms 1468.7500 1437.5000 937.5000 16.15 MB
Lookup 14.548 ms 0.0455 ms 0.0380 ms 968.7500 953.1250 468.7500 10.39 MB
Counting 3.844 ms 0.0307 ms 0.0287 ms 1007.8125 992.1875 992.1875 4.21 MB
CountingCast 2.211 ms 0.0271 ms 0.0240 ms 652.3438 636.7188 636.7188 2.77 MB

API Proposal

namespace System.Linq;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;


    public static IEnumerable<KeyValuePair<TKey, long>> LongCountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

API Usage

var array = new []{1, 2, 2, 3, 3, 3};

var counts = array.CountBy(x => x);

foreach (var (key, count) in counts)
    Console.WriteLine($"{key}: {count}");

Find any most frequent element:

enumerable
    .CountBy(x => x)
    .MaxBy(x => x.Value)
    .Key;

Find duplicates:

enumerable
    .CountBy(x => x)
    .Where(x => x.Value > 1)
    .Select(x => x.Key);

Alternative Designs

1. With separated result type, inspired by ILookup

namespace System.Linq;
public interface ICounts<TKey, TCount> : IEnumerable<(TKey Key, TCount Count)> where TKey : notnull
{
    int KeyCount { get; }

    TCount this[TKey key] { get; }

    bool Contains(TKey key);
}

public static partial class Enumerable
{
    public static ICounts<TKey, int> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) where TKey : notnull;
    public static ICounts<TKey, int> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer) where TKey : notnull;
    public static ICounts<TKey, long> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector) where TKey : notnull;
    public static ICounts<TKey, long> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer) where TKey : notnull;
}

2. Same as previous, but without TKey : notnull constraint

namespace System.Linq;
public interface ICounts<TKey, TCount> : IEnumerable<(TKey Key, TCount Count)>
{
    int KeyCount { get; }

    TCount this[TKey key] { get; }

    bool Contains(TKey key);
}

public static partial class Enumerable
{
    public static ICounts<TKey, int> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector);
    public static ICounts<TKey, int> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer);
    public static ICounts<TKey, long> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector);
    public static ICounts<TKey, long> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer);
}
Find duplicates:
ICounts<TKey, int> counts = enumerable.CountBy(x => x);
IEnumerable<TKey> duplicates = counts
    .Where(x => x.Count > 1)
    .Select(x => x.Key);

3.

The extension method for IEnumerable<IGrouping<TKey, TValue>> with internal knowledge about GroupingEnumerator allowing to Count values without materializing lists in memory.

array.GroupBy(x => x).Counts();

Questions to decide:

  • Should we use IEnumerable<KeyValuePair<TKey, TValue>> as a return type or specific type instead for better naming, like IEnumerable<(TKey Key, TCount Count)>:
Find duplicates:
IEnumerable<KeyCountPair<TKey, int>> counts = enumerable.CountBy(x => x);
IEnumerable<TKey> duplicates = counts
    .Where(x => x.Count > 1)
    .Select(x => x.Key);
  • Should we return IEnumerable<> for simplicity or go with IDictioinary<>/IReadOnlyDictionary<>/ICounts<> to avoid copying? Specific return type could improve performance for ~x1.5 for cases when user needs a lookup.
    F# returns sequence, Kotlin — map, Python — special map-like type

  • Should TKey allow null? For example, ToLookup allow null and Dictionary not.

Risks

Looks like there is no C# method with the same name in dotnet organization projects.
https://github.com/search?q=org%3Adotnet+CountBy&type=code

@epeshk epeshk added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 1, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 1, 2022
@ghost
Copy link

ghost commented Nov 1, 2022

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Currently, it is not so easy to count occurrences of IEnumerable<T> elements with existing LINQ methods in C#.

var arr = new[]{1, 2, 2, 3, 3, 3};

var counts = arr
  .GroupBy(x => x)
  .ToDictionary(x => x.Key, x => x.Count());

Motivation

  • It is not efficient to use GroupBy for counting because it materializes Grouping objects with a list of values related to each key.
  • It is not intuitive that .Count() will not enumerate the sequence again.
  • Other popular languages provide simpler API for this

Other language examples

F#:

let arr = [|1; 2; 2; 3; 3; 3|]
let counts = arr |> Seq.countBy id

Python:

from collections import Counter

arr = [1, 2, 2, 3, 3, 3]
counts = Counter(arr)

Kotlin:

val arr = arrayOf(1, 2, 2, 3, 3, 3)
val counts = arr.groupingBy { it }.eachCount()

Benchmark

Benchmark results:

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19044.2130/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.2.22477.23
  [Host]     : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2


|       Method |       Mean |    Error |   StdDev |    Gen0 |    Gen1 |  Allocated |
|------------- |-----------:|---------:|---------:|--------:|--------:|-----------:|
|     Grouping | 1,031.7 us | 18.61 us | 17.41 us | 60.5469 | 39.0625 | 1017.17 KB |
|     Counting |   673.5 us |  5.99 us |  5.31 us |       - |       - |    9.52 KB |
| CountingCast |   635.3 us |  2.42 us |  2.26 us |       - |       - |    7.25 KB |

API Proposal

namespace System.Collections.Generic;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull
}

API Usage

var array = new []{1, 2, 2, 3, 3, 3};

var counts = array.CountBy(x => x);

foreach (var (key, count) in counts)
    Console.WriteLine($"{key}: {count}");

Alternative Designs

// 1

namespace System.Collections.Generic;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, long>> LongCountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull
}

// 2
The extension method for IEnumerable<IGrouping<TKey, TValue>> with internal knowledge about GroupingEnumerator allowing to Count values without materializing lists in memory.

array.GroupBy(x => x).Counts();

Risks

Looks like there is no C# method with the same name in dotnet organization projects.
https://github.com/search?q=org%3Adotnet+CountBy&type=code

Author: epeshk
Assignees: -
Labels:

api-suggestion, area-System.Linq

Milestone: -

@theodorzoulias
Copy link
Contributor

The CountBy LINQ operator already exists in the MoreLinq package, offering the same functionality with an identical API.

public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer);

@epeshk
Copy link
Contributor Author

epeshk commented Nov 5, 2022

MinBy and MaxBy also existed in MoreLinq before adding to BCL

@theodorzoulias
Copy link
Contributor

@epeshk yep, true. I am just pointing out that if someone wants this functionality today, it is offered by a respectable third-party library. No need to wait until the advent of .NET 8 (assuming that this API proposal will be approved).

@krwq krwq added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Nov 23, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Nov 23, 2022
@krwq krwq added this to the Future milestone Nov 23, 2022
@krwq
Copy link
Member

krwq commented Nov 23, 2022

I'm marking this as Future for now because we don't have this on our plans for now. If anyone is willing to contribute implementation I will change it to 8.0 which should give API review higher priority

@krwq krwq added the help wanted [up-for-grabs] Good issue for external contributors label Nov 23, 2022
@epeshk
Copy link
Contributor Author

epeshk commented Nov 23, 2022

I want to contribute this feature and start working on it in https://github.com/epeshk/runtime/tree/countBy
Should I open PR now, before API review?

@eiriktsarpalis
Copy link
Member

Please hold until the issue is marked api-approved. Thanks!

@epeshk
Copy link
Contributor Author

epeshk commented Nov 23, 2022

Please also note return type and nullability concerns. It is probably good to discuss and remove bad options before review.

I think good choices for the return type are (ordered by my preferences):

  • ICounts<TKey, int> : IEnumerable<(TKey Key, int Count)>
  • IEnumerable<(TKey Key, int Count)>
  • IReadOnlyDictionary<TKey, int>
  • IEnumerable<KeyValuePair<TKey, int>>

This method also may or may not accept nulls as TKey. An implementation without nulls would be simpler, but we have to think about it in terms of usability, and e.g. similar ToLookup accept nulls.

@viceroypenguin
Copy link

viceroypenguin commented Nov 23, 2022

For source-level compatibility with MoreLinq and SuperLinq, it would be helpful to keep return type as either ICounts<TKey, int>, IEnumerable<(TKey Key, int Count)>, or IEnumerable<KeyValuePair<TKey, int>>. If this is done, then these libraries can simply disable their version of this operator starting in .net8 and consumers can be comfortable using the same code through multiple versions of .net/SuperLinq/MoreLinq.

@viceroypenguin
Copy link

Also, both MoreLinq and SuperLinq allow null values for TKey, if that impacts decision making at all.

@eiriktsarpalis eiriktsarpalis self-assigned this Nov 23, 2022
@eiriktsarpalis eiriktsarpalis removed the help wanted [up-for-grabs] Good issue for external contributors label Nov 23, 2022
@KieranDevvs
Copy link

KieranDevvs commented Dec 1, 2022

Seeing as OrderBy and OrderByDescending had overloads to remove their predicate in the event that you were selecting the input as the key, should that be applied here?

x.CountBy();
x.CountBy(y => y.Key);
x.LongCountBy();
x.LongCountBy(y => y.Key);

Also, I feel like prefixing the methods with Long removes their discoverability in intelisense. and in documentation.

@svick
Copy link
Contributor

svick commented Dec 1, 2022

@KieranDevvs Those versions of OrderBy also remove the By suffix, but Enumerable.Count() and Enumerable.LongCount() already exist and do something different than the CountBy proposed here.

And since this is a fairly niche method, I don't think it's important for it to have a more succinct way to write .CountBy(x => x).

@krwq @eiriktsarpalis Since @epeshk offered to contribute implementation, should this be moved back to the 8.0 milestone, to speed up API review?

@WeihanLi
Copy link
Contributor

WeihanLi commented Jun 1, 2023

Any chance to get it in for the .NET 8 release?

@eiriktsarpalis eiriktsarpalis removed their assignment Jun 1, 2023
@eiriktsarpalis eiriktsarpalis added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jun 1, 2023
@eiriktsarpalis
Copy link
Member

@WeihanLi very unlikely at this point.

@eiriktsarpalis eiriktsarpalis added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Jun 1, 2023
@MizardX
Copy link

MizardX commented Aug 1, 2023

Name suggestions:

  • Counts() / CountsBy(keySelector)
  • GroupCount() / GroupCountBy(keySelector)

@eiriktsarpalis
Copy link
Member

A potential alternative is having an AggregateBy method:

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
    this IEnumerable<TSource> source, 
    TAccumulate seed, 
    Func<TAccumulate, TKey, TSource, TAccumulate> func, 
    Func<TSource, TKey> keySelector);

which lets you express CountBy like so

source.AggregateBy(0, (count, _, _) => ++count, keySelector);

@viceroypenguin
Copy link

A potential alternative is having an AggregateBy method:

public static IEnumerable<KeyValuePair<TKey, TAccumulate>> AggregateBy<TSource, TAccumulate, TKey>(
    this IEnumerable<TSource> source, 
    TAccumulate seed, 
    Func<TAccumulate, TKey, TSource, TAccumulate> func, 
    Func<TSource, TKey> keySelector);

which lets you express CountBy like so

source.AggregateBy(0, (count, _, _) => ++count, keySelector);

I would suggest that the seed may be determined by the key, so I propose:

public static IEnumerable<(TKey key, TAccumulate result)> AggregateBy<TSource, TKey, TAccumulate>(
	this IEnumerable<TSource> source,
	Func<TSource, TKey> keySelector,
	Func<TKey, TAccumulate> seedSelector,
	Func<TAccumulate, TKey, TSource, TAccumulate> accumulator)

which lets you express CountBy like so

source.AggregateBy(_ => 0, (count, _, _) => ++count, keySelector);

@bartonjs
Copy link
Member

bartonjs commented Aug 3, 2023

Video

  • We don't think LongCountBy would be called often enough to warrant adding it at this time (based on low call counts of LongCount)
  • We think we'd be more likely to add AggregateBy than LongCountBy to solve the long-count-by needs, but also don't think that it's warranted at this time.
  • We added the IQueryable variant, consistent with how we maintain Enumerable and Queryable in general.
namespace System.Linq;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

public static partial class Queryable
{
    public static IQueryable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IQueryable<TSource> source,
        Expression<Func<TSource, TKey>> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Aug 3, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 2, 2023
@eiriktsarpalis eiriktsarpalis added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-approved API was approved in API review, it can be implemented labels Sep 5, 2023
@eiriktsarpalis eiriktsarpalis self-assigned this Sep 5, 2023
@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Sep 5, 2023

Bringing back for API review since the question of eager vs. delayed evaluation was brought up, see

#91507 (comment)

Proposed alternative API with eager semantics:

namespace System.Linq;

public static partial class Enumerable
{
    public static IReadOnlyDictionary<TKey, int> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

public static partial class Queryable
{
    public static IQueryable<IReadOnlyDictionary<TKey, int>> CountBy<TSource, TKey>(
        this IQueryable<TSource> source,
        Expression<Func<TSource, TKey>> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

@eiriktsarpalis eiriktsarpalis modified the milestones: Future, 9.0.0 Sep 5, 2023
@bartonjs
Copy link
Member

After a long discussion we ended up back at the previously-approved shape, with a delayed evaluation.

namespace System.Linq;

public static partial class Enumerable
{
    public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

public static partial class Queryable
{
    public static IQueryable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(
        this IQueryable<TSource> source,
        Expression<Func<TSource, TKey>> keySelector,
        IEqualityComparer<TKey>? keyComparer = null) where TKey : notnull;
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Sep 12, 2023
@MizardX
Copy link

MizardX commented Sep 12, 2023

For Max/MaxBy, Order/OrderBy, etc. the only difference is the predicate parameter, but they still do very similar work. You are trying to introduce a CountBy which does something related, but still very different than Count. A better name would be GroupCountBy which would allow you to also have a GroupCount method that would count duplicates.

@eiriktsarpalis
Copy link
Member

This was discussed, however CountBy is a fairly standard name in methods of this sort across ecosystems.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 14, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Oct 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants