Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Enumerable.SkipLast, TakeLast to get the last N elements of a sequence. #19431

Closed
jamesqo opened this issue Nov 21, 2016 · 27 comments
Closed
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@jamesqo
Copy link
Contributor

jamesqo commented Nov 21, 2016

Background

Currently, Linq does not offer an API for getting the last N elements of a sequence. This has resulted in no less than 5 StackOverflow questions (with ~300 votes) popping up as the first result when you Google "skip last n elements of enumerable."

Proposal

We should add SkipLast and TakeLast to Enumerable, which will skip/take the last N elements of the sequence, respectively. If there are less than N elements, SkipLast will return an empty sequence, and TakeLast will return a sequence with contents equivalent to the original enumerable.

namespace System.Linq
{
    public static class Enumerable
    {
        public static IEnumerable<TSource> SkipLast<TSource>(this IEnumerable<TSource> source, int count);
        public static IEnumerable<TSource> TakeLast<TSource>(this IEnumerable<TSource> source, int count);
    }
    public static class Queryable
    {
        // Lots of existing methods for parity with Enumerable
   
        public static IQueryable<T> SkipLast<T>(this IQueryable<T> source, int count);
        public static IQueryable<T> TakeLast<T>(this IQueryable<T> source, int count);
    }
}

Remarks

  • In SkipLast instead of evaluating the whole thing at once, we will read in the first count items into a circular buffer. Then we will interleave between yield returning the oldest element & overwriting that with a new one.

  • In TakeLast we will still evaluate the whole sequence during the first iteration, but again maintain a circular buffer of length count and overwrite the oldest elements with newer ones.

Original (incorrect) - Implementation remarks

These overloads will have subtly different semantics from Skip and Take. Since for lazy enumerables we can't determine the count of the sequence in advance, we will have to capture it into an array during our first iteration (like Reverse does today), then skip/take the last N items from that array between yields.

    public static IEnumerable<TSource> SkipLast<TSource>(this IEnumerable<TSource> source, int count)
    {
        var array = source.ToArray();

        for (int i = count; i < array.Length; i++)
        {
            yield return array[i];
        }
    }

As a result, these will have different outputs:

int[] a = { 1, 2, 3, 4, 5 };

foreach (var item in a.Skip(3))
{
    a[4] = 10;
    Console.WriteLine(item); // 4 10
}

a[4] = 5;

foreach (var item in a.TakeLast(2))
{
    a[4] = 10;
    Console.WriteLine(item); // 4 5
}

Another consequence is that this will lead to more allocations for lists/arrays. Perhaps this is worth it for added convenience, though.

@jamesqo
Copy link
Contributor Author

jamesqo commented Nov 21, 2016

cc @JonHanna, @svick, @VSadov, @justinvp

@JonHanna
Copy link
Contributor

The difference in outputs isn't too big a concern IMO, especially since variations in behaviour along similar lines already exist.

I've wanted a TakeLast() in the past, though it wasn't too arduous in the circumstances to just ToList() and then take the chunk I needed, it's always nice to have a convenient and (if possible) efficient form to hand. SkipLast() would at the very least be analogous (I'm of the opinion that things that seem like gaps in an API are as much in need of justification as additions are).

The most memory-efficient approach to TakeLast(4) would only buffer 4 elements. The most time-efficient would buffer the lot and hence not need to do any copying. Having two buffers of 4 elements seems a reasonable balance between the two, as large enough that no copying would be needed, but then immediately opting for that approach would mean that TakeLast(10000) could create a massive buffer even if there were only a small number of elements in the source, so two growing buffers would be needed to avoid that.

@jamesqo
Copy link
Contributor Author

jamesqo commented Nov 21, 2016

@JonHanna

The most memory-efficient approach to TakeLast(4) would only buffer 4 elements. The most time-efficient would buffer the lot and hence not need to do any copying.

I don't think we would need to do any copying with a buffer of 4 elements, actually. We could just implement it as a circular buffer of length N that automatically overwrote the most stale items, if I understand correctly. You're right though that we may need to do some dynamic growth in case count is really large.

@svick
Copy link
Contributor

svick commented Nov 21, 2016

@JonHanna I don't understand, why would buffering only 4 elements require copying? The way I imagine the implementation, it would use List<T> in two phases. Phase one would be filling the list, phase two would treat it as circular buffer. At the end, you would yield items from the list (starting from the right position). Is there any copying in there that I'm missing?

SkipLast could be implemented almost the same way, except it would yield while iterating in phase two, not after iterating like TakeLast.

In code:

public static IEnumerable<T> TakeLast<T>(this IEnumerable<T> source, int count)
{
	var buffer = new List<T>();
	int pos = 0;
	
	foreach (var item in source)
	{
		if (buffer.Count < count)
		{
			// phase 1
			buffer.Add(item);
		}
		else
		{
			// phase 2
			buffer[pos] = item;
			pos = (pos+1) % count;
		}
	}
	
	for (int i = 0; i < buffer.Count; i++)
	{
		yield return buffer[pos];
		pos = (pos+1) % count;
	}
}

public static IEnumerable<T> SkipLast<T>(this IEnumerable<T> source, int count)
{
	var buffer = new List<T>();
	int pos = 0;
	
	foreach (var item in source)
	{
		if (buffer.Count < count)
		{
			// phase 1
			buffer.Add(item);
		}
		else
		{
			// phase 2
			yield return buffer[pos];
			buffer[pos] = item;
			pos = (pos+1) % count;
		}
	}
}

@JonHanna
Copy link
Contributor

Yes, we could be circular. You are both correct.

@jamesqo
Copy link
Contributor Author

jamesqo commented Nov 24, 2016

@karelz, maybe this could be marked ready-for-review? I don't think this would be a very controversial API; as I showed in the description there are many questions on StackOverflow this would solve.

@karelz
Copy link
Member

karelz commented Nov 24, 2016

@VSadov @OmarTawfik are in the middle of Linq area triage (see areas), they will get to it ...

@jamesqo
Copy link
Contributor Author

jamesqo commented Nov 24, 2016

@karelz 👍 Thanks for linking me to that document, I did not know about it.

@karelz
Copy link
Member

karelz commented Nov 24, 2016

Docs discoverability and cleanup is on my list for December :)

@karelz karelz changed the title Proposal: Enumerable.SkipLast, TakeLast to get the last N elements of a sequence. Add Enumerable.SkipLast, TakeLast to get the last N elements of a sequence. Nov 29, 2016
@terrajobst
Copy link
Member

The API looks good as proposed. @stephentoub reminded us that Queue<T> already uses a circular buffer strategy, so implementing this wouldn't be too hard.

@JonHanna
Copy link
Contributor

I'd like to see Queryable keep its parity with Enumerable on any method that doesn't materialise a collection (since that drags things into memory anyway) or attach elements like Append.

@JonHanna
Copy link
Contributor

Right, I wasn't going to try typing this on my phone. I think we should have something like this:

public static IQueryable<T> SkipLast<T>(this IQueryable<T> source, int count)
{
    if (source == null)
    {
        throw Error.ArgumentNull(nameof(source));
    }

    EnumerableQuery eq = source as EnumerableQuery;
    if (eq != null)
    {
        // Call optimised version for in-memory collections.
        EnumerableQuery<T> eqt = source as EnumerableQuery<T>;
        if (eqt != null)
        {
            return new EnumerableQuery<T>(eqt.AsEnumerable().SkipLast(count));
        }

        // EnumerableQuery of a different element type, passed covariantly. Call for the actual type.
        return (IQueryable<T>) new Func<IQueryable<T>, int, IQueryable<T>>(SkipLast).GetMethodInfo()
            .GetGenericMethodDefinition()
            .MakeGenericMethod(source.ElementType)
            .Invoke(null, new object[] {source, count});
    }

    // Create a query that does a SkipLast. QueryProviders may recongise the sequence and optimise further
    return source.Reverse().Skip(count).Reverse();
}

public static IQueryable<T> TakeLast<T>(this IQueryable<T> source, int count)
{
    if (source == null)
    {
        throw Error.ArgumentNull(nameof(source));
    }

    EnumerableQuery eq = source as EnumerableQuery;
    if (eq != null)
    {
        // Call optimised version for in-memory collections.
        EnumerableQuery<T> eqt = source as EnumerableQuery<T>;
        if (eqt != null)
        {
            return new EnumerableQuery<T>(eqt.AsEnumerable().TakeLast(count));
        }

        // EnumerableQuery of a different element type, passed covariantly. Call for the actual type.
        return (IQueryable<T>)new Func<IQueryable<T>, int, IQueryable<T>>(SkipLast).GetMethodInfo()
            .GetGenericMethodDefinition()
            .MakeGenericMethod(source.ElementType)
            .Invoke(null, new object[] {source, count});
    }

    // Create a query that does a TakeLast. QueryProviders may recongise the sequence and optimise further
    return source.Reverse().Take(count).Reverse();
}

@karelz
Copy link
Member

karelz commented Nov 29, 2016

Do we have a static type for extension methods where we should put these?
Should we track these APIs as separate API proposal (given that we already have approval), or should we add it to it and go back to review next week?
Are there more APIs missing on IQueryable? Do we need to do broader sweep to bring it to parity with Enumerable?

Of course: I'd like to wait on @VSadov @OmarTawfik to agree with the approach.

@JonHanna
Copy link
Contributor

JonHanna commented Nov 29, 2016

@karelz they'd go into System.Linq.Queryable and currently there's parity bar those I mentioned above that only make sense to have for one or the other (e.g. to do ToList we have to enumerate elements anyway, and therefore it only belongs on Enumerable). There's also a test in the test project for each that'll fail if you add a public method to one and not either add it to the other, or add it to the list of exceptions in the test (which should of course be done because there's a good reason for the exception, rather than just to make the failure go away).

(Actually, we had one parity gap in that IOrderedQueryable<T> was covariant and IOrderedEnumerable<T> wasn't. So we've had parity as of 5 hours ago).

@karelz
Copy link
Member

karelz commented Nov 29, 2016

Let's make it part of this proposal then. It was just oversight we didn't make it part of the original proposal.

Adding:

public static class System.Linq.Queryable
{
    // Lots of existing methods for parity with Enumerable
   
    public static IQueryable<T> SkipLast<T>(this IQueryable<T> source, int count);
    public static IQueryable<T> TakeLast<T>(this IQueryable<T> source, int count);
}

@karelz karelz closed this as completed Nov 29, 2016
@karelz karelz reopened this Nov 29, 2016
@JonHanna
Copy link
Contributor

JonHanna commented Nov 29, 2016

The alternative implementation approach is to do much as we already do for queryable methods:

public static IQueryable<TSource> SkipLast<TSource>(this IQueryable<TSource> source, int count)
{
    if (source == null)
        throw Error.ArgumentNull(nameof(source));
    return source.Provider.CreateQuery<TSource>(
        Expression.Call(
            null,
            // We'd cache this methodinfo, but for sake of discussion…
            new Func<IQueryable<TSource>, int, IQueryable<TSource>>(Queryable.SkipLast).GetMethodInfo(),
            source.Expression, Expression.Constant(count)
            ));
}

And then leave it up to query providers to add support. It's obviously simpler, but I wonder how long it would take the providers to catch up.

@karelz
Copy link
Member

karelz commented Nov 29, 2016

OK, I got offline approval on the Queyable part.

Next step: We need someone to implement it. Any takers?

@jamesqo
Copy link
Contributor Author

jamesqo commented Nov 29, 2016

@karelz I will implement the Enumerable part; I've been thinking about it for a while now. I am not all that familiar with Queryable though, so @JonHanna can take that if he wants.

@karelz
Copy link
Member

karelz commented Nov 30, 2016

I think we should take only PR with both parts. If we do just half, it will create artificial debt.
@jamesqo can you work with @JonHanna together and prepare a change for both?

@JonHanna
Copy link
Contributor

They should really go in together, but if you make a start on the enumerable bit I can help with the queryable.

The question is whether to take the first approach I mention (rewrites itself to a query current technology supports) or the second (describes itself like other queryable methods do and lets the provider deal with it).

@VSadov @bartdesmet opinions? (Input people involved with Entity Framework and other query providers would be good too, but I've no idea who to ping on that).

@bartdesmet
Copy link
Contributor

I would stick with the second traditional approach for a variety of reasons.

First, I wouldn't force the need to write a decompiler to recognize e.g. xs.Reverse().Take(n).Reverse() as xs.TakeLast(n) in every query provider, including new ones that may pop up. It adds complexity to support a direct mapping of a query operator to a target query language element (even the simplest &last=n append to a query string would require all this reverse engineering).

Second, even if existing query providers work well with an expanded form of the query operator into a composition of existing operators, it may lead to very bad performance. This is a similar concern as the first reason above. To fix it, a query provider author would now be faced with recognizing the expanded pattern rather than simply start to recognize TakeLast etc.

Third, if we expand the pattern today, there's no way back to do the simplest quoting possible at some point in the future, for it'd be a breaking change. I'd rather bite the bullet now and state that the set of query operators is open-ended and subject to future extension. In fact, most query providers only support a subset of query operators and throw for the ones they don't recognize.

Fourth, GroupBy returns an IQueryable<IGrouping<K, T>>, thus exposing IEnumerable<T> for the inner sequences. Writing an IQueryable query a la xs.GroupBy(f).SelectMany(g => g.TakeLast(n).Sum()) will result in binding of TakeLast and Sum to the Enumerable-based methods. As such, the expression tree received by the query provider will see the Enumerable.TakeLast method in a Call expression. For TakeLast to be accepted by the query provider, it'd have to be enlightened about this case anyway. All in all, it makes the result less predictable: TakeLast would only work if it doesn't occur in a quoted lambda expression deeper down the tree.

I've done a fair amount of query provider stuff myself, but @divega from EF could chime in here as well. I think we should keep Queryable methods as the glue that builds straightforward quotes of the method calls and honor the principle of least surprise. Optimizers, rewriters, etc. can be layered on top. In fact, some of the operator fusion we're doing in IEnumerable<T> land can be done in the IQueryable<T> space by rewriting the expression based on algebraic identities of operators as part of a general IQueryable<T> optimizer. We got quite a few of these in the projects I'm working on in Bing here.

@bartdesmet
Copy link
Contributor

FWIW, Ix.NET (part of Rx.NET, providing the IEnumerable<T> dual to our IObservable<T> operators) does have TakeLast and SkipLast including query provider support. See QueryableEx.cs for the query provider part. It provides sibling IEnumerable<T> methods in the same class, which get picked up by EnumerableRewriter, allowing for rebinding to the LINQ to Objects version.

The IEnumerable<T> implementations (not necessary optimal, though honoring laziness) can be found in Take.cs (using a queue) and Skip.cs.

@divega
Copy link
Contributor

divega commented Nov 30, 2016

I agree with @bartdesmet on sticking to the traditional approach for Queryable methods. Among all the reasons he mentioned, I think query providers should really control how unrecognized methods are handled. E.g. while some providers will throw, others could switch to in-memory evaluation or call back into user code.

@JonHanna
Copy link
Contributor

Or indeed they could do what the first suggestion does very quickly and have producing a more optimal implementation as a TODO.
Right so, the second approach it is. @jamesqo ping me when you've a branch with support for this, even if it isn't fully tested, and I'll so how this side of it works.
(We're going to have to put in a P2P reference from S.L.Queryable to S.L, temporarily. I trust that is okay?)

@jamesqo
Copy link
Contributor Author

jamesqo commented Dec 3, 2016

@bartdesmet Thanks for the links; the reference implementations were useful to look at.

@JonHanna I made an implementation + some tests here, so you are free to continue where I left off. There is a caveat, though: I was unsure how to get the stuff under netcoreapp1.1 to build (I tried passing /p:TargetGroup=netcoreapp1.1 which didn't seem to work), after trying for half an hour. So I don't know if the tests will pass or even compile. Sorry about that.

@JonHanna
Copy link
Contributor

JonHanna commented Dec 4, 2016

Grand so. Pull https://github.com/jamesqo/corefx/pull/3 into that branch and you've got queryable support done, including tests. The tests don't need to be as extensive as considering all the possible permutations of types of source and optimised cases is for the provider to worry about (which to start off is just you, since Enumerable is the basis of most of the work for linq-to-objects on queryable). I've ignored the versioning-related issues. I have put in a P2P reference in queryable/tests so that the tests can call into the current version of S.Linq to run (they pass!) but not done the reverse so that the consistency tests in S.Linq can pass.

FYI, the actual methods here are quite simple, aside from a little bit of effort into reducing the cost of reflection. The methods do that reflection on themselves so that e.g. a call to SkipLast returns a new queryable with an expression based on the previous but with "and then call SkipLast on it" added. This self-referential quality leaves it up to the query providers to actually know what to do with that. In the case of an IEnumerable<T> treated AsQueryable() the provider is EnumerableQuery and all of the heavy lifting for it gets done in EnumerableRewriter by taking all the calls to Queryable's methods and rewriting them as calls to the equivalent method on Enumerable, so in such a case it ends up coming back to your methods. Other query providers would have to attempt to do something else with it.

@jamesqo
Copy link
Contributor Author

jamesqo commented Dec 4, 2016

@JonHanna Thanks for the explanation.

I have put in a P2P reference in queryable/tests so that the tests can call into the current version of S.Linq to run (they pass!) but not done the reverse so that the consistency tests in S.Linq can pass.

Does this mean you got the SkipLast / TakeLast tests you added in jamesqo#3 to compile and run? If so, can you share how? When I run msbuild /t:RebuildAndTest this is the list of files that is fed to the compiler

AggregateTests.cs AllTests.cs AnyTests.cs AsEnumer
ableTests.cs AverageTests.cs CastTests.cs AppendPrependTests.cs ConcatTests.cs C
onsistencyTests.cs ContainsTests.cs CountTests.cs DefaultIfEmptyTests.cs Distinc
tTests.cs ElementAtOrDefaultTests.cs ElementAtTests.cs EmptyEnumerable.cs EmptyP
artitionTests.cs EnumerableDebugViewTests.cs EnumerableTests.cs ExceptTests.cs F
irstOrDefaultTests.cs FirstTests.cs GroupByTests.cs GroupJoinTests.cs IntersectT
ests.cs JoinTests.cs LastOrDefaultTests.cs LastTests.cs LongCountTests.cs MaxTes
ts.cs MinTests.cs OfTypeTests.cs OrderByDescendingTests.cs OrderByTests.cs Order
edSubsetting.cs RangeTests.cs RepeatTests.cs ReverseTests.cs SelectManyTests.cs
SelectTests.cs SequenceEqualTests.cs ShortCircuitingTests.cs Shuffler.cs SingleO
rDefaultTests.cs SingleTests.cs SkipTests.cs SkipWhileTests.cs SumTests.cs TakeT
ests.cs TakeWhileTests.cs ThenByDescendingTests.cs ThenByTests.cs ToArrayTests.c
s ToDictionaryTests.cs ToListTests.cs ToLookupTests.cs UnionTests.cs WhereTests.
cs ZipTests.cs

Notice that all of the netcoreapp1.1 tests are not being included (even after I tried passing /p:TargetGroup=netcoreapp1.1) so I'm unsure how to run them locally.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.0.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Linq help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

No branches or pull requests

8 participants