[Proposal] SortedDistinctBy #836

jods4 · 2022-03-11T16:59:05Z

Same API as DistinctBy, but assumes a sorted input (by the distinct key), so it doesn't have to keep a full hashmap of previously seen values.

The text was updated successfully, but these errors were encountered:

Shelim · 2022-03-31T22:37:42Z

This is needed! There is no Linq equivalent, and using Distinct on sorted data is so much worse it terms of O's complexity...
Especially if you need to keep them sorted - as official docs clearly states that result sequence (of Distinct) is unordered. Therefore you may need to sort it again, even more skyrocking the computation cost

atifaziz · 2022-04-01T09:10:01Z

I believe GroupAdjacent should give you what you're looking for. See also Exploring MoreLINQ Part 18 - GroupAdjacent and Segment by @markheath that includes a video.

Following is an example showing GroupAdjacent being used to average values per day where data is assumed to be sorted by day that's the key:

var data = new[]
{
    new { Date = new DateTime(2010, 1, 1), Value = 1 },
    new { Date = new DateTime(2010, 1, 1), Value = 2 },
    new { Date = new DateTime(2010, 1, 1), Value = 3 },
    new { Date = new DateTime(2010, 1, 2), Value = 4 },
    new { Date = new DateTime(2010, 1, 2), Value = 5 },
    new { Date = new DateTime(2010, 1, 2), Value = 6 },
};

var q =
    from g in data.GroupAdjacent(e => e.Date)
    select new { Date = g.Key, Value = g.Average(e => e.Value) };

foreach (var e in q)
    Console.WriteLine(e.ToString());

Prints:

{ Date = 01/01/2010, Value = 2 }
{ Date = 02/01/2010, Value = 5 }

I'll close this assuming GroupAdjacent does the trick and happy to re-open and re-consider if I misunderstood the problem.

jods4 · 2022-04-01T11:39:53Z

@atifaziz GroupAdjacent is kind of similar but it returns a Key and enumerable of values, when all you want is the first value for each key.

You can get that as values.GroupAdjacent(x => x.Date).Select(x => x.First()) but it's a bit clumsy and not as efficient as straightforward implementation of OrderedDistinctBy(x => x.Date).

Shelim · 2022-04-01T12:39:04Z

@atifaziz Close enough, but I believe it is still a bit more expensive than directly implemented OrderedDistinctBy through sharing the same O's. And way harder to find in docs actually, which leaded to my post above :)

atifaziz · 2022-04-01T15:00:10Z

when all you want is the first value for each key.

What if you want the last value? What if you want a count of items under the key? Who decides? Assuming the first would be very limiting.

You can get that as values.GroupAdjacent(x => x.Date).Select(x => x.First()) but it's a bit clumsy and not as efficient as straightforward implementation of OrderedDistinctBy(x => x.Date).

It's not that it's clumsy, but perhaps wasteful to collect all values for a group only to throw away all but the first.

And way harder to find in docs actually, which leaded to my post above :)

@Shelim That's a hard problem to solve.

I think what's being asked here is a variation of GroupAdjacent that would be called AggregateAdjacent. Instead of collecting values of a key, it would let the caller decide how to accumulate/aggregate/fold into a single result. Here's a prototype of what that could potentially look like:

public static IEnumerable<TResult> AggregateAdjacent<TSource, TKey, TAccumulator, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TKey, TSource, TAccumulator> seedSelector,
    Func<TKey, TAccumulator, TSource, TAccumulator> aggregator,
    Func<TKey, TAccumulator, TResult> resultSelector,
    IEqualityComparer<TKey>? comparer = null)
{
    comparer ??= EqualityComparer<TKey>.Default;

    using var item = source.GetEnumerator();
    
    if (!item.MoveNext())
        yield break;

    var key = keySelector(item.Current);
    var runKey = key;
    var accumulator = seedSelector(key, item.Current);

    while (item.MoveNext())
    {
        key = keySelector(item.Current);
        if (comparer.Equals(runKey, key))
        {
            accumulator = aggregator(runKey, accumulator, item.Current);
            continue;
        }
        else
        {
            yield return resultSelector(runKey, accumulator);
            runKey = key;
            accumulator = seedSelector(key, item.Current);
        }
    }

    yield return resultSelector(runKey, accumulator);
}

Here are some examples of uses:

var data = new[]
{
    new { Date = new DateTime(2010, 1, 1), Value = 1 },
    new { Date = new DateTime(2010, 1, 1), Value = 2 },
    new { Date = new DateTime(2010, 1, 2), Value = 4 },
    new { Date = new DateTime(2010, 1, 2), Value = 5 },
    new { Date = new DateTime(2010, 1, 2), Value = 6 },
    new { Date = new DateTime(2010, 1, 3), Value = 7 },
};

Console.WriteLine("First of each date:");
foreach (var item in data.AggregateAdjacent(e => e.Date, (_, e) => e, (_, a, _) => a, (_, a) => a))
    Console.WriteLine(item.ToString());

Console.WriteLine("Last of each date:");
foreach (var item in data.AggregateAdjacent(e => e.Date, (_, e) => e, (_, _, e) => e, (_, e) => e))
    Console.WriteLine(item.ToString());

Console.WriteLine("Count per date:");
foreach (var item in data.AggregateAdjacent(e => e.Date, (_, e) => 1, (_, a, _) => a + 1,
                                            (d, a) => new { Date = d, Count = a }))
    Console.WriteLine(item.ToString());

Prints:

First of each date:
{ Date = 01/01/2010 00:00:00, Value = 1 }
{ Date = 02/01/2010 00:00:00, Value = 4 }
{ Date = 03/01/2010 00:00:00, Value = 7 }
Last of each date:
{ Date = 01/01/2010 00:00:00, Value = 2 }
{ Date = 02/01/2010 00:00:00, Value = 6 }
{ Date = 03/01/2010 00:00:00, Value = 7 }
Count per date:
{ Date = 01/01/2010 00:00:00, Count = 2 }
{ Date = 02/01/2010 00:00:00, Count = 3 }
{ Date = 03/01/2010 00:00:00, Count = 1 }

To permit multiple aggregations efficiently in a single iteration, one could then use the same strategy as we did with Aggregate in #326.

jods4 · 2022-04-01T15:18:24Z

What if you want the last value?

The semantics of the existing DistinctBy method is to return the first row for each distinct key.
It seems natural that OrderedDistinctBy has the same contract but is optimized for sorted sequences (DistinctBy needs to buffer all seen keys into a hash table to avoid yielding the same key twice, if the sequence is sorted your state only needs the current key, it's O(N) vs O(1)).

This is intuitive and quite useful.

If someone needs something else they should use another primitive like the AggregateAdjacent as you've mentionned.
Or maybe file an issue to add LastDistinctBy and OrderedLastDistinctBy to morelinq.

What if you want a count of items under the key?

This is clearly not a question for DistinctBy. The Distinct family returns a sequence of unique items, not groups. If you want counts, averages, sums or whatnot, then you should turn to GroupBy and/or Aggregate sets of apis. Typically the AggregateAdjacent examples you've made above.

I feel like you're expanding this into harder territory than needs be.
morelinq has several apis that have optimized counterparts when the source sequence is assumed to be sorted.

DistinctBy is a useful method and this issue is only about adding an optimized OrderedDistinctBy that performs the same operation with less local state. It's not complicated and it's useful.

Having new higher-level apis that perform optimized aggregations over sorted sequences is a great idea but it's a different, more complex thing that surely should have its own issue.

Shelim · 2022-04-04T10:18:42Z

Exactly as stated

The base-pure LINQ Distinct is also replaceable by GroupBy and then Select -> First. But it is optimized for this specific - and not so rare! - case of choosing distinct values. And that is the reason it exists in the first place.

Orace · 2022-05-11T14:33:16Z

What if the input is not sorted?
Does the fact that the input is sorted is relevant?
As an example, consider: [0, 0, 2, 2, 1, 1]

I propose:

A DistinctBy that take an IOrderedEnumerable and that implement the optimized algorithm.
It can be an overload or a try cast in the current implementation.
An equivalent of the RX distinctUntilChanged method. I propose DistinctAdjacent and DistinctAdjacentBy.

jods4 · 2022-05-11T16:34:51Z

Adjacent could be another name for sure. 👍
It's important that the source is "sorted" in the sense that identical values are adjacent, otherwise the result will not be correct.

"Ordered" is a vague concept anyway when a comparator is not given... for all we know they could be sorted in descending order, or with their bits reversed, or anything really...

Orace · 2022-05-11T18:01:00Z

It's important that the source is "sorted" in the sense that identical values are adjacent, otherwise the result will not be correct.

It's important for a use case of yours.
Again, should the method have to check for a sorted / ordered input? With the limitations that this check imply (A hashset of already encountered values / comparable values or value comparer).
I think that the method should have a single purpose: collapse repeated adjacent values, whatever the input is.
The fact that it then matchs your use case on a sorted input is just an happy incident.

jods4 · 2022-05-11T18:06:30Z

Yes, I agree.
The method shouldn't validate that the input is sorted, and in fact as I mentioned before it's impossible without taking an extra comparator parameter because you can't assume how it's sorted.

The expected result is totally deterministic and if you have repeated but not adjacent values, this API would yield two items which have non-distinct key (albeit spaced by at least one other item).

jods4 · 2022-05-11T18:11:39Z

For reference here's the code we use for that in my project.
Feel free to copy, for a public API maybe you'll want:

Non-null parameters validation (internally we have nullable refs enabled so we don't care).
An overload that takes a custom equality comparer (we didn't need one).
For max perf you could drop the first variable and work with IEnumerator directly but it's more involved and needs a using in case it's disposable. Not sure it makes much difference.

    public static IEnumerable<T> OrderedDistinctBy<T, K>(this IEnumerable<T> source, Func<T, K> selector)
    {
      var comparer = EqualityComparer<K>.Default;
      var first = true;
      var previous = default(K);
      foreach (var x in source)
      {
        if (first)
        {
          previous = selector(x);
          first = false;
          yield return x;
          continue;
        }

        var current = selector(x);
        if (!comparer.Equals(previous!, current))
        {
          previous = current;
          yield return x;
        }
      }
    }

viceroypenguin · 2022-07-05T17:00:35Z

@jods4 - FYI: This is already implemented in the System.Interactive package as .DistinctUntilChanged().

atifaziz · 2022-11-12T15:04:06Z

@jods4 - FYI: This is already implemented in the System.Interactive package as .DistinctUntilChanged().

@jods4, @Shelim: I don't see the point in duplicating DistinctUntilChanged here since System.Interactive already offers what you're looking for and the more generic version didn't appeal, so perhaps we can close this?

Shelim · 2022-11-13T00:24:11Z

@atifaziz
I see there is a very strong opposition to add it in MoreLINQ directly (and I still do not get why, but whatever...), so I believe so. But please, at least mention it somewhere in the docs - issues does not appear high in search and I personally searched LINQ Extended when looking for DistinctUntilChanged, and I was not even aware of Rx async components. They did not appear within first three pages on Google while MoreLINQ was the first hit...

jods4 · 2022-11-14T08:52:50Z

@atifaziz
If you don't want to add it to MoreLINQ, then you can close this, of course!
I thought it was an excellent fit for MoreLINQ, but it's your call 😉

I don't want to bring in System.Interactive just for a single function that is less than 20 LoC, so I'm gonna stick with my local implementation.

viceroypenguin · 2022-11-14T10:08:47Z

Actually, this is not something that needs a separate operator- there is already a solution that exists using the existing operators:

var distinct = source.Lag((curr, lag) => (curr, lag)).Where(x => !comparer.Equals(x.lag, x.curr)).Select(x => x.curr);

This has similar memory performance as your implementation, but doesn't require a full operator to be implemented.

atifaziz closed this as completed Apr 1, 2022

atifaziz reopened this Apr 1, 2022

viceroypenguin mentioned this issue Jun 13, 2022

AdjacentDistinctBy() viceroypenguin/SuperLinq#5

Closed

atifaziz mentioned this issue Nov 2, 2023

Improve GroupAdjacent #1032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] SortedDistinctBy #836

[Proposal] SortedDistinctBy #836

jods4 commented Mar 11, 2022

Shelim commented Mar 31, 2022 •

edited

Loading

atifaziz commented Apr 1, 2022

jods4 commented Apr 1, 2022

Shelim commented Apr 1, 2022

atifaziz commented Apr 1, 2022

jods4 commented Apr 1, 2022

Shelim commented Apr 4, 2022

Orace commented May 11, 2022

jods4 commented May 11, 2022

Orace commented May 11, 2022

jods4 commented May 11, 2022

jods4 commented May 11, 2022 •

edited

Loading

viceroypenguin commented Jul 5, 2022

atifaziz commented Nov 12, 2022

Shelim commented Nov 13, 2022

jods4 commented Nov 14, 2022

viceroypenguin commented Nov 14, 2022 •

edited

Loading

[Proposal] SortedDistinctBy #836

[Proposal] SortedDistinctBy #836

Comments

jods4 commented Mar 11, 2022

Shelim commented Mar 31, 2022 • edited Loading

atifaziz commented Apr 1, 2022

jods4 commented Apr 1, 2022

Shelim commented Apr 1, 2022

atifaziz commented Apr 1, 2022

jods4 commented Apr 1, 2022

Shelim commented Apr 4, 2022

Orace commented May 11, 2022

jods4 commented May 11, 2022

Orace commented May 11, 2022

jods4 commented May 11, 2022

jods4 commented May 11, 2022 • edited Loading

viceroypenguin commented Jul 5, 2022

atifaziz commented Nov 12, 2022

Shelim commented Nov 13, 2022

jods4 commented Nov 14, 2022

viceroypenguin commented Nov 14, 2022 • edited Loading

Shelim commented Mar 31, 2022 •

edited

Loading

jods4 commented May 11, 2022 •

edited

Loading

viceroypenguin commented Nov 14, 2022 •

edited

Loading