Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate PostgreSQL DISTINCT ON #894

Open
roji opened this issue May 28, 2019 · 30 comments
Open

Generate PostgreSQL DISTINCT ON #894

roji opened this issue May 28, 2019 · 30 comments
Labels
Milestone

Comments

@roji
Copy link
Member

roji commented May 28, 2019

EF Core currently pushes down a select expression into a subquery, since a projection would make the results different (SQL DISTINCT operates over the selected columns, whereas C# Distinct() operates on the entire entity).

PostgreSQL has a DISTINCT ON syntax which allows specifying which columns should participate, regardless of whether they're in the projection. We could simply do distinct on the key columns.

Note: LINQ now has DistinctBy

@roji roji added enhancement New feature or request performance labels May 28, 2019
@roji roji added this to the Backlog milestone May 28, 2019
@neobenedict
Copy link

neobenedict commented Aug 6, 2020

Are there any plans for this to be implemented? What's the alternative to DISTINCT ON in the mean time?

@roji
Copy link
Member Author

roji commented Aug 6, 2020

@neobenedict what I had in mind here was an optimization under the hood rather than something users access directly - see dotnet/efcore#15811 (comment) for a discussion on this.

What exactly is it that you're trying to do, do you have a LINQ query that isn't translating as you'd like it to?

@neobenedict
Copy link

neobenedict commented Aug 6, 2020

I need to perform a seach on a very large table of records, but only search the most recent record of a certain type.

Here's the current query generated by entity framework
https://gist.github.com/neobenedict/fc0ba82a2d7604173700fdd69954a557

I need to modify it by changing the start to SELECT DISTINCT ON ("f.userSvtId") and then add that to the ORDER BY, then wrap the entire query in another ORDER BY like so...

https://gist.github.com/neobenedict/9dfa6e99a6c5cda3383832bf4c00faa4

This then:

  • Takes the most recent entry per userSvtId in either of the two tables depending on what the DISTINCT ON is searching
  • Filters it
  • Sorts it

On 500k rows this takes 200ms, moving the DISTINCT ON into the FROM (which would work with entity framework, ie .FromSqlRaw("SELECT DISTINCT ON(SomeColumn), * FROM Table ORDER BY Meh").Where(x => x.Blah == 42)) and doing all the filtering in a separate query takes 2000ms, 10x slower.

@roji
Copy link
Member Author

roji commented Aug 6, 2020

Can you post the LINQ you're using with EF Core?

@neobenedict
Copy link

neobenedict commented Aug 6, 2020

Rough idea without the unique filtering I need:

var baseQuery = Db.FriendServants;
baseQuery = baseQuery.Where(x => x.svtId == (int)SearchServantId
            && x.skillLv1 >= SearchServantSkill1
            && x.skillLv2 >= SearchServantSkill2
            && x.skillLv3 >= SearchServantSkill3
            && x.treasureDeviceLv >= SearchServantNp);

baseQuery = baseQuery.Where(x => x.equip.svtId == (int)SearchCraftEssenceId);

var results = baseQuery.OrderByDescending(x => x.updated).ToListAsync(CancellationToken.Token);

Any attempt to put a .GroupBy seems to load the entire table into memory so it's a no-go.

x.equip is a virtual field public virtual FriendEquipModel equip { get; set; } which is the join.

I have resorted to writing raw SQL again for now unfortunately.

@neobenedict
Copy link

neobenedict commented Aug 6, 2020

A way around this is to add a column to the table bool latest and whenever something new with the same userSvtId is added, set that to true and everything else with the same ID to false. Then just .Where(x => x.latest == true). I think I may go with that approach instead...

@roji
Copy link
Member Author

roji commented Aug 6, 2020

I'm confused - there's no distinct operation anywhere in the LINQ or in the 1st SQL query above... I don't have the full context on what you're trying to do, but your two SQL queries above seem like they'd return quite different results.

If you're just trying to get the latest row for the same userSvtId, the simple way would be to have an (indexed) timestamp column, and then get the row with the latest timestamp for each user. This isn't related to DISTINCT in any way.

Otherwise, maybe try producing minimal LINQ and SQL queries so it's easier to understand what you're trying to achieve and where DISTINCT ON would fit in.

Any attempt to put a .GroupBy seems to load the entire table into memory so it's a no-go.

Uhh, that is definitely not supposed to happen... What version of EF Core are you using? Unless it's pre-3.0, can you please do a minimal repro for GroupBy loading the entire table?

@neobenedict
Copy link

neobenedict commented Aug 6, 2020

The distinct operation needed is in the second gist. I don't have any idea how to implement it into LINQ without .GroupBy. But here's how it can be done without SQL:

var baseQuery = Db.FriendServants;
baseQuery = baseQuery.Where(x => x.svtId == (int)SearchServantId
            && x.skillLv1 >= SearchServantSkill1
            && x.skillLv2 >= SearchServantSkill2
            && x.skillLv3 >= SearchServantSkill3
            && x.treasureDeviceLv >= SearchServantNp);

baseQuery = baseQuery.Where(x => x.equip.svtId == (int)SearchCraftEssenceId);

var prelimResults = baseQuery.OrderByDescending(x => x.updated);

var results = new List<FriendServantsModel>();

foreach(var result in prelimResults) 
{
    if (results.Count(x => x.userSvtId == result.userSvtId) > 0) // As it is already sorted by time, the latest will always be added first
        continue;
    results.Add(result);
}

@roji
Copy link
Member Author

roji commented Aug 6, 2020

That doesn't really seem to be related to DISTINCT or DISTINCT BY... Are you looking for something like the following?

var blogsWithTheirLatestPosts = ctx.Blogs.Select(b => new
{
    Blog = b,
    LatestPost = b.Posts.OrderBy(p => p.CreatedOn).Last()
}).ToList();

This produces:

SELECT b."Id", b."Name", t0."Id", t0."BlogId", t0."CreatedOn"
FROM "Blogs" AS b
LEFT JOIN (
    SELECT t."Id", t."BlogId", t."CreatedOn"
    FROM (
        SELECT p."Id", p."BlogId", p."CreatedOn", ROW_NUMBER() OVER(PARTITION BY p."BlogId" ORDER BY p."CreatedOn" DESC) AS row
        FROM "Post" AS p
    ) AS t
    WHERE t.row <= 1
) AS t0 ON b."Id" = t0."BlogId"

@neobenedict
Copy link

neobenedict commented Aug 6, 2020

I'm not sure how to use this syntax.

        var finalQuery = baseQuery.Select(b => Db.FriendServants.Where(x => x.userSvtId == b.userSvtId).OrderBy(x => x.updated).Last());

Seems to produce 771 results instead of 601, as it contains many of the same userSvtId. The extras are a duplicate result of the most recent. It's also really, really slow. 2.5 seconds vs 0.2

SQL: https://gist.github.com/neobenedict/c84355c5b8d00e5fad50c60d34cba506

@neobenedict
Copy link

OK adding distinct() does work here but it's still very slow: https://explain.depesz.com/s/4yQZ

@roji
Copy link
Member Author

roji commented Aug 8, 2020

I've done a quick comparison between ROW_NUMBER() and DISTINCT ON and the latter is indeed faster. For extracting 1 million rows out of a 3 million row table, ROW_NUMBER() took 1624ms whereas DISTINCT ON did the same in 688ms (the plans outputted by EXPLAIN show a perf gap as well). See below for the precise scenario and benchmark, and thanks for drawing my attention to this @neobenedict .

There is currently no way to make EF Core generate SQL with DISTINCT ON, since that's a PostgreSQL-specific SQL extension (that's what this issue is about). You can still use a raw SQL query though, as you've done above.

@neobenedict note that you've reported a difference between 2.5 seconds and 0.2, which is much more than I've seen in my scenario. You may want to investigate deeper and isolate all other irrelevant components out of your query, so that we get a better idea of what's going on. Also, if perf is important you probably want to consider a composite index that would speed up the query, in case you haven't done that already.

Benchmark

Setup

DROP TABLE IF EXISTS data;
CREATE TABLE data (
    id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    parent_id INTEGER,
    updated TIMESTAMP);

CREATE INDEX comp_idx ON data (parent_id, updated);

DO $$BEGIN
    FOR i IN 1..1000000 LOOP
        INSERT INTO data (parent_id, updated) VALUES (i, '2020-01-01'::timestamp);
        INSERT INTO data (parent_id, updated) VALUES (i, '2020-01-02'::timestamp);
        INSERT INTO data (parent_id, updated) VALUES (i, '2020-01-03'::timestamp);
    END LOOP;
END$$;

ROW_NUMBER

SELECT id, parent_id, updated FROM (
    SELECT *, row_number() OVER (PARTITION BY parent_id ORDER BY updated) AS row
    FROM data
) AS t
WHERE row = 1;

DISTINCT ON

SELECT DISTINCT ON (parent_id) id, parent_id, updated
FROM data
ORDER BY parent_id, updated;

/cc @smitpatel for an interesting use of the PG-specific DISTINCT ON instead of some usages of ROW_NUMBER.

@neobenedict
Copy link

neobenedict commented Aug 8, 2020

I've ended up just adding a column that does the job of DISTINCT ON (true if it can be searched, false if not) and updating that from elsewhere in the code, as it's way faster and I don't have to change my complicated search query into raw SQL.

But yes, implementation of this would be a good addition, as there's many other areas I've had to use DISTINCT ON with raw SQL in my codebase. I would do it myself but I don't know how I would go about writing an extension for this unfortunately (can an extension class be made, or does it require modifying efcore.pg itself?), and my time at the moment is quite limited anyway.

@roji
Copy link
Member Author

roji commented Aug 8, 2020

@neobenedict I definitely agree that if having an additional column is an option for you, it's probably the simplest/easiest way to fix this.

Having EFCore.PG generate DISTINCT ON instead of ROW_NUMBER (where possible) is likely to be a non-trivial change... It could be done as an extension (replacing the stock EFCore.PG service which translates LINQ to SQL), but since this seems like superior SQL it's probably better to integrate this change into EFCore.PG itself (and there wouldn't be much difference in the work required).

I'm unlikely to get around to doing this anytime soon - if you feel like giving this a stab please post back here so we know you're looking at it. Otherwise I'll probably get around to it at some point.

And thanks again for posting your use case! I didn't have DISTINCT ON in mind as a possible replacement for EF Core's current ROW_NUMBER mechanism!

@smitpatel
Copy link

DISTINCT ON and ROW_NUMBER are different things. And of course ROW_NUMBER is slow because it assigns number to each row, not just drop duplicated one.

  • While nowhere written in the issue, query user is looking for is GroupBy().Select(g => g.OrderBy().FirstOrDefault()). GroupBy operation would do distinct automatically. Currently this is in backlog in EF Core.
  • I doubt that DISTINCT ON can match any window function since window functions don't remove any duplicates. If you are just selecting first row then sure it would be same result but still different calculations under the hood. And translation pipeline is dealing with general purpose case not very specific case where this can be used. So postgre can post-process after translation. (Still I think first point above is better query).

In case postgre want to optimize the translation and not postprocess then it can pre-process the LINQ query shape to add a custom operation (or can also provide a LINQ method to users). And then translate it accordingly in provider.

@neobenedict
Copy link

* While nowhere written in the issue, query user is looking for is GroupBy().Select(g => g.OrderBy().FirstOrDefault()). GroupBy operation would do distinct automatically. Currently this is in backlog in EF Core.

As far as I remember, I tried similar queries to this with .First(), .FirstOrDefault(), Last() and so on and always received a similar error:

System.InvalidOperationException: The LINQ expression '(GroupByShaperExpression:
KeySelector: new { 
    userId = (f.userId), 
    region = (f.region)
 }, 
ElementSelector:(EntityShaperExpression: 
    EntityType: FriendSupportModel
    ValueBufferExpression: 
        (ProjectionBindingExpression: EmptyProjectionMember)
    IsNullable: False
)
)
    .First()' could not be translated. Either rewrite the query in a form that can be translated, or switch to client evaluation explicitly by inserting a call to either AsEnumerable(), AsAsyncEnumerable(), ToList(), or ToListAsync(). See https://go.microsoft.com/fwlink/?linkid=2101038 for more information.
   at Microsoft.EntityFrameworkCore.Query.RelationalSqlTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression)
   at Npgsql.EntityFrameworkCore.PostgreSQL.Query.Internal.NpgsqlSqlTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCall)

I don't have the original queries for these any more but the specific problem was joining the distinct row back into the query. So I could do a FirstOrDefault() in a select but not a join.

Here's (part of) the query in use now

(from svt in finalQuery
                           join su in
                               Db.FriendSupports.Where(x => x.isMostRecent) on svt.userId equals su.userId
                           join users in Db.Users.Where(x => x.FriendId) on su.friendCode equals users.FriendId into ua
                           from ue in ua.DefaultIfEmpty()
                           join up in Db.FriendProfileOptions.Where(x => x.Region == 2) on ue.Id equals up.UserId into upa
                           from upo in upa.DefaultIfEmpty()
                           select new SupportPlayerPairing()
                           {
                               ...
                           });

So it would be having to replace that x.isMostRecent with the distinct selector instead, which iirc created the above error. Since a true/false index is so much faster I'm keeping it either way.

@neobenedict
Copy link

neobenedict commented Aug 10, 2020

Reproduced a similar bug:

            var test = Db.FriendSupports.GroupBy(x => x.userId)
                .Select(g => g.OrderBy(x => x.updated)
                    .FirstOrDefault()).ToList();
System.InvalidOperationException: The LINQ expression '(GroupByShaperExpression:
KeySelector: (f.userId), 
ElementSelector:(EntityShaperExpression: 
    EntityType: FriendSupportModel
    ValueBufferExpression: 
        (ProjectionBindingExpression: EmptyProjectionMember)
    IsNullable: False
)
)
    .OrderBy(x => x.updated)' could not be translated. Either rewrite the query in a form that can be translated, or switch to client evaluation explicitly by inserting a call to either AsEnumerable(), AsAsyncEnumerable(), ToList(), or ToListAsync(). See https://go.microsoft.com/fwlink/?linkid=2101038 for more information.
   at Microsoft.EntityFrameworkCore.Query.RelationalSqlTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression)
   at Npgsql.EntityFrameworkCore.PostgreSQL.Query.Internal.NpgsqlSqlTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCall)
   at Microsoft.EntityFrameworkCore.Query.RelationalSqlTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression)
   at Npgsql.EntityFrameworkCore.PostgreSQL.Query.Internal.NpgsqlSqlTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCall)
   at Microsoft.EntityFrameworkCore.Query.RelationalSqlTranslatingExpressionVisitor.Translate(Expression expression)
   at Microsoft.EntityFrameworkCore.Query.Internal.RelationalProjectionBindingExpressionVisitor.Visit(Expression expression)
   at Microsoft.EntityFrameworkCore.Query.Internal.RelationalProjectionBindingExpressionVisitor.Translate(SelectExpression selectExpression, Expression expression)
   at Microsoft.EntityFrameworkCore.Query.RelationalQueryableMethodTranslatingExpressionVisitor.TranslateSelect(ShapedQueryExpression source, LambdaExpression selector)
   at Microsoft.EntityFrameworkCore.Query.QueryableMethodTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression)
   at Microsoft.EntityFrameworkCore.Query.RelationalQueryableMethodTranslatingExpressionVisitor.VisitMethodCall(MethodCallExpression methodCallExpression)
   at Microsoft.EntityFrameworkCore.Query.QueryCompilationContext.CreateQueryExecutor[TResult](Expression query)
   at Microsoft.EntityFrameworkCore.Storage.Database.CompileQuery[TResult](Expression query, Boolean async)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryCompiler.CompileQueryCore[TResult](IDatabase database, Expression query, IModel model, Boolean async)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryCompiler.<>c__DisplayClass9_0`1.<Execute>b__0()
   at Microsoft.EntityFrameworkCore.Query.Internal.CompiledQueryCache.GetOrAddQueryCore[TFunc](Object cacheKey, Func`1 compiler)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryCompiler.Execute[TResult](Expression query)
   at Microsoft.EntityFrameworkCore.Query.Internal.EntityQueryProvider.Execute[TResult](Expression expression)
   at Microsoft.EntityFrameworkCore.Query.Internal.EntityQueryable`1.GetEnumerator()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)

@smitpatel
Copy link

As I mentioned in my original comment that issue is in backlog, here is the link dotnet/efcore#12088

Postgre could consider an alternate translation of it based on "Distinct ON" and OrderBy.

@neobenedict
Copy link

neobenedict commented Aug 10, 2020

Ah, I see, missed your comment there, apologies.

Just to clarify, this needs to be implemented into efcore first, before it can be added to efcore.pg?

@smitpatel
Copy link

Based on above error message, a simple GroupBy - Group.FirstOrDefault clears pre-processing pipeline. So postgre could translate it outside of EF Core. If translation would be same then of course doing in EF Core makes more sense to avoid duplicating work. If translation is going to be very postgre specific then perhaps postgre can implement independently whenever.

For raw DISTINCT ON operator, postgre can add additional query operator and appropriate translation for it.

@roji
Copy link
Member Author

roji commented Aug 11, 2020

@smitpatel

DISTINCT ON and ROW_NUMBER are different things.

DISTINCT ON definitely won't replace all uses of ROW_NUMBER, for sure. Apart from GroupBy().Select(FOD)) I was thinking about the below kind of pattern (mentioned above):

var blogsWithTheirLatestPosts = ctx.Blogs.Select(b => new
{
    Blog = b,
    LatestPost = b.Posts.OrderBy(p => p.CreatedOn).Last()
}).ToList();

Do you see any issue with that?

For raw DISTINCT ON operator, postgre can add additional query operator and appropriate translation for it.

Yep, that would be nice. I'll try to look at all this at some point.

@YohDeadfall
Copy link
Contributor

I'll try to look at all this at some point.

Then let's push .NET platform folks on dotnet/runtime#19522, dotnet/runtime#27665 and dotnet/runtime#27687.

@roji
Copy link
Member Author

roji commented Aug 11, 2020

@YohDeadfall if would definitely be nice to have this as a first-class LINQ operator, but I don't think it's blocking for us - we could have the operator under EF.Functions for now.

@YohDeadfall
Copy link
Contributor

Yeah, but we have weighty arguments why this methods should be added. So while inventing a new EF function, we could help the community promoting the issues.

@roji
Copy link
Member Author

roji commented Aug 11, 2020

Tiny note to @smitpatel: as mentioned in dotnet/runtime#27665 (comment), GroupBy - Group.FirstOrDefault can be simplified to .GroupBy(a => (a.Stuff, a.Stuff1), (key, group) => group.First()), which we could also translate if we want to.

Unfortunately it's not easy to add stuff to LINQ (not sure when something new made it in), but I agree it would be best to have this.

@smitpatel
Copy link

GroupBy - Group.FirstOrDefault can be simplified to .GroupBy(a => (a.Stuff, a.Stuff1), (key, group) => group.First())

Put another operator between GroupBy & Select and that falls apart. Anything is really easy to translate when you put things inside GroupBy's result selector. That is not the biggest issue in translations of GroupBy.

@roji
Copy link
Member Author

roji commented Mar 23, 2021

Note: DistinctBy (and other *By operators) are going into .NET 6.0, dotnet/runtime#27687.

@neobenedict
Copy link

Wow, that's great news.

@roji roji modified the milestones: Backlog, 6.0.0 Mar 23, 2021
@roji roji modified the milestones: 6.0.0, Backlog Aug 19, 2021
@roji
Copy link
Member Author

roji commented Feb 18, 2022

Note dotnet/efcore#27470 which tracks this on the EF side. For anyone stuck on this, with EF Core 6.0 you should be able to use the following equivalent construct:

_ = blogs.GroupBy(b => b.Id).Select(g => g.First());

@roji roji added the blocked label Jun 9, 2022
@roji
Copy link
Member Author

roji commented Jun 9, 2022

Marking as blocked - we should have the EF-side infrastructure for this operator before implementing the PG-specific optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants