-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate PostgreSQL DISTINCT ON #894
Comments
Are there any plans for this to be implemented? What's the alternative to |
@neobenedict what I had in mind here was an optimization under the hood rather than something users access directly - see dotnet/efcore#15811 (comment) for a discussion on this. What exactly is it that you're trying to do, do you have a LINQ query that isn't translating as you'd like it to? |
I need to perform a seach on a very large table of records, but only search the most recent record of a certain type. Here's the current query generated by entity framework I need to modify it by changing the start to SELECT DISTINCT ON ("f.userSvtId") and then add that to the ORDER BY, then wrap the entire query in another ORDER BY like so... https://gist.github.com/neobenedict/9dfa6e99a6c5cda3383832bf4c00faa4 This then:
On 500k rows this takes 200ms, moving the DISTINCT ON into the FROM (which would work with entity framework, ie |
Can you post the LINQ you're using with EF Core? |
Rough idea without the unique filtering I need: var baseQuery = Db.FriendServants;
baseQuery = baseQuery.Where(x => x.svtId == (int)SearchServantId
&& x.skillLv1 >= SearchServantSkill1
&& x.skillLv2 >= SearchServantSkill2
&& x.skillLv3 >= SearchServantSkill3
&& x.treasureDeviceLv >= SearchServantNp);
baseQuery = baseQuery.Where(x => x.equip.svtId == (int)SearchCraftEssenceId);
var results = baseQuery.OrderByDescending(x => x.updated).ToListAsync(CancellationToken.Token); Any attempt to put a
I have resorted to writing raw SQL again for now unfortunately. |
A way around this is to add a column to the table |
I'm confused - there's no distinct operation anywhere in the LINQ or in the 1st SQL query above... I don't have the full context on what you're trying to do, but your two SQL queries above seem like they'd return quite different results. If you're just trying to get the latest row for the same userSvtId, the simple way would be to have an (indexed) timestamp column, and then get the row with the latest timestamp for each user. This isn't related to DISTINCT in any way. Otherwise, maybe try producing minimal LINQ and SQL queries so it's easier to understand what you're trying to achieve and where DISTINCT ON would fit in.
Uhh, that is definitely not supposed to happen... What version of EF Core are you using? Unless it's pre-3.0, can you please do a minimal repro for GroupBy loading the entire table? |
The distinct operation needed is in the second gist. I don't have any idea how to implement it into LINQ without var baseQuery = Db.FriendServants;
baseQuery = baseQuery.Where(x => x.svtId == (int)SearchServantId
&& x.skillLv1 >= SearchServantSkill1
&& x.skillLv2 >= SearchServantSkill2
&& x.skillLv3 >= SearchServantSkill3
&& x.treasureDeviceLv >= SearchServantNp);
baseQuery = baseQuery.Where(x => x.equip.svtId == (int)SearchCraftEssenceId);
var prelimResults = baseQuery.OrderByDescending(x => x.updated);
var results = new List<FriendServantsModel>();
foreach(var result in prelimResults)
{
if (results.Count(x => x.userSvtId == result.userSvtId) > 0) // As it is already sorted by time, the latest will always be added first
continue;
results.Add(result);
} |
That doesn't really seem to be related to DISTINCT or DISTINCT BY... Are you looking for something like the following? var blogsWithTheirLatestPosts = ctx.Blogs.Select(b => new
{
Blog = b,
LatestPost = b.Posts.OrderBy(p => p.CreatedOn).Last()
}).ToList(); This produces: SELECT b."Id", b."Name", t0."Id", t0."BlogId", t0."CreatedOn"
FROM "Blogs" AS b
LEFT JOIN (
SELECT t."Id", t."BlogId", t."CreatedOn"
FROM (
SELECT p."Id", p."BlogId", p."CreatedOn", ROW_NUMBER() OVER(PARTITION BY p."BlogId" ORDER BY p."CreatedOn" DESC) AS row
FROM "Post" AS p
) AS t
WHERE t.row <= 1
) AS t0 ON b."Id" = t0."BlogId" |
I'm not sure how to use this syntax. var finalQuery = baseQuery.Select(b => Db.FriendServants.Where(x => x.userSvtId == b.userSvtId).OrderBy(x => x.updated).Last());
SQL: https://gist.github.com/neobenedict/c84355c5b8d00e5fad50c60d34cba506 |
OK adding distinct() does work here but it's still very slow: https://explain.depesz.com/s/4yQZ |
I've done a quick comparison between ROW_NUMBER() and DISTINCT ON and the latter is indeed faster. For extracting 1 million rows out of a 3 million row table, ROW_NUMBER() took 1624ms whereas DISTINCT ON did the same in 688ms (the plans outputted by EXPLAIN show a perf gap as well). See below for the precise scenario and benchmark, and thanks for drawing my attention to this @neobenedict . There is currently no way to make EF Core generate SQL with DISTINCT ON, since that's a PostgreSQL-specific SQL extension (that's what this issue is about). You can still use a raw SQL query though, as you've done above. @neobenedict note that you've reported a difference between 2.5 seconds and 0.2, which is much more than I've seen in my scenario. You may want to investigate deeper and isolate all other irrelevant components out of your query, so that we get a better idea of what's going on. Also, if perf is important you probably want to consider a composite index that would speed up the query, in case you haven't done that already. BenchmarkSetupDROP TABLE IF EXISTS data;
CREATE TABLE data (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
parent_id INTEGER,
updated TIMESTAMP);
CREATE INDEX comp_idx ON data (parent_id, updated);
DO $$BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO data (parent_id, updated) VALUES (i, '2020-01-01'::timestamp);
INSERT INTO data (parent_id, updated) VALUES (i, '2020-01-02'::timestamp);
INSERT INTO data (parent_id, updated) VALUES (i, '2020-01-03'::timestamp);
END LOOP;
END$$; ROW_NUMBERSELECT id, parent_id, updated FROM (
SELECT *, row_number() OVER (PARTITION BY parent_id ORDER BY updated) AS row
FROM data
) AS t
WHERE row = 1; DISTINCT ONSELECT DISTINCT ON (parent_id) id, parent_id, updated
FROM data
ORDER BY parent_id, updated; /cc @smitpatel for an interesting use of the PG-specific |
I've ended up just adding a column that does the job of DISTINCT ON (true if it can be searched, false if not) and updating that from elsewhere in the code, as it's way faster and I don't have to change my complicated search query into raw SQL. But yes, implementation of this would be a good addition, as there's many other areas I've had to use DISTINCT ON with raw SQL in my codebase. I would do it myself but I don't know how I would go about writing an extension for this unfortunately (can an extension class be made, or does it require modifying efcore.pg itself?), and my time at the moment is quite limited anyway. |
@neobenedict I definitely agree that if having an additional column is an option for you, it's probably the simplest/easiest way to fix this. Having EFCore.PG generate DISTINCT ON instead of ROW_NUMBER (where possible) is likely to be a non-trivial change... It could be done as an extension (replacing the stock EFCore.PG service which translates LINQ to SQL), but since this seems like superior SQL it's probably better to integrate this change into EFCore.PG itself (and there wouldn't be much difference in the work required). I'm unlikely to get around to doing this anytime soon - if you feel like giving this a stab please post back here so we know you're looking at it. Otherwise I'll probably get around to it at some point. And thanks again for posting your use case! I didn't have DISTINCT ON in mind as a possible replacement for EF Core's current ROW_NUMBER mechanism! |
DISTINCT ON and ROW_NUMBER are different things. And of course ROW_NUMBER is slow because it assigns number to each row, not just drop duplicated one.
In case postgre want to optimize the translation and not postprocess then it can pre-process the LINQ query shape to add a custom operation (or can also provide a LINQ method to users). And then translate it accordingly in provider. |
As far as I remember, I tried similar queries to this with .First(), .FirstOrDefault(), Last() and so on and always received a similar error:
I don't have the original queries for these any more but the specific problem was joining the distinct row back into the query. So I could do a FirstOrDefault() in a select but not a join. Here's (part of) the query in use now (from svt in finalQuery
join su in
Db.FriendSupports.Where(x => x.isMostRecent) on svt.userId equals su.userId
join users in Db.Users.Where(x => x.FriendId) on su.friendCode equals users.FriendId into ua
from ue in ua.DefaultIfEmpty()
join up in Db.FriendProfileOptions.Where(x => x.Region == 2) on ue.Id equals up.UserId into upa
from upo in upa.DefaultIfEmpty()
select new SupportPlayerPairing()
{
...
}); So it would be having to replace that x.isMostRecent with the distinct selector instead, which iirc created the above error. Since a true/false index is so much faster I'm keeping it either way. |
Reproduced a similar bug:
|
As I mentioned in my original comment that issue is in backlog, here is the link dotnet/efcore#12088 Postgre could consider an alternate translation of it based on "Distinct ON" and OrderBy. |
Ah, I see, missed your comment there, apologies. Just to clarify, this needs to be implemented into efcore first, before it can be added to efcore.pg? |
Based on above error message, a simple GroupBy - Group.FirstOrDefault clears pre-processing pipeline. So postgre could translate it outside of EF Core. If translation would be same then of course doing in EF Core makes more sense to avoid duplicating work. If translation is going to be very postgre specific then perhaps postgre can implement independently whenever. For raw |
DISTINCT ON definitely won't replace all uses of ROW_NUMBER, for sure. Apart from GroupBy().Select(FOD)) I was thinking about the below kind of pattern (mentioned above): var blogsWithTheirLatestPosts = ctx.Blogs.Select(b => new
{
Blog = b,
LatestPost = b.Posts.OrderBy(p => p.CreatedOn).Last()
}).ToList(); Do you see any issue with that?
Yep, that would be nice. I'll try to look at all this at some point. |
Then let's push .NET platform folks on dotnet/runtime#19522, dotnet/runtime#27665 and dotnet/runtime#27687. |
@YohDeadfall if would definitely be nice to have this as a first-class LINQ operator, but I don't think it's blocking for us - we could have the operator under EF.Functions for now. |
Yeah, but we have weighty arguments why this methods should be added. So while inventing a new EF function, we could help the community promoting the issues. |
Tiny note to @smitpatel: as mentioned in dotnet/runtime#27665 (comment), GroupBy - Group.FirstOrDefault can be simplified to Unfortunately it's not easy to add stuff to LINQ (not sure when something new made it in), but I agree it would be best to have this. |
Put another operator between GroupBy & Select and that falls apart. Anything is really easy to translate when you put things inside GroupBy's result selector. That is not the biggest issue in translations of GroupBy. |
Note: DistinctBy (and other *By operators) are going into .NET 6.0, dotnet/runtime#27687. |
Wow, that's great news. |
Note dotnet/efcore#27470 which tracks this on the EF side. For anyone stuck on this, with EF Core 6.0 you should be able to use the following equivalent construct: _ = blogs.GroupBy(b => b.Id).Select(g => g.First()); |
Marking as blocked - we should have the EF-side infrastructure for this operator before implementing the PG-specific optimization. |
EF Core currently pushes down a select expression into a subquery, since a projection would make the results different (SQL DISTINCT operates over the selected columns, whereas C# Distinct() operates on the entire entity).
PostgreSQL has a DISTINCT ON syntax which allows specifying which columns should participate, regardless of whether they're in the projection. We could simply do distinct on the key columns.
Note: LINQ now has DistinctBy
The text was updated successfully, but these errors were encountered: