Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SQL translation for .Net 6 Linq's MinBy/MaxBy Methods #25566

Open
Tracked by #25570
Webreaper opened this issue Aug 18, 2021 · 21 comments
Open
Tracked by #25570

Support SQL translation for .Net 6 Linq's MinBy/MaxBy Methods #25566

Webreaper opened this issue Aug 18, 2021 · 21 comments
Assignees
Labels
area-query customer-reported punted-for-7.0 Originally planned for the EF Core 7.0 (EF7) release, but moved out due to resource constraints. type-enhancement
Milestone

Comments

@Webreaper
Copy link

Webreaper commented Aug 18, 2021

I'm using .Net 6 and the new MinBy/MaxBy methods are great. However, there's no default translation to SQL for them - despite this being seemingly a very natural thing to be able to do.

I can do this: db.SomeDbSet.ToList().MinBy( x => x.SomeField ) but obviously the problem with that is that it's going to have to load the entire set into memory, and won't use any of the DB indexes etc, so will be slow.

Would be great if this could be added before EFCore 6 is released (or as an early preview for EFCore 7).

I'm using the SQLite provider, if it makes a difference.

@roji
Copy link
Member

roji commented Aug 18, 2021

Relevant runtime issue: dotnet/runtime#27687

We could translate like this:

SELECT * FROM employees WHERE salary = (SELECT MAX(salary) FROM employees);

Need to look into behavior when multiple source items have the same maximum value (we could use LIMIT 1/TOP 1 to get an arbitrary row, if that's the right behavior).

Opened #25570 to track support for all new LINQ features introduced in .NET 6.0.

@smitpatel
Copy link
Contributor

There are possibly other ways to write the same query as a work-around since these operators cannot be translated to SQL in direct native way.
(There exist OVER clause for Min/Max but they also compute Min/Max over a scalar value so the complex object which can specify the by part doesn't work)

@ajcvickers
Copy link
Contributor

Note from triage: putting this in 7.0 to investigate reasonable translations.

@AraHaan

This comment was marked as off-topic.

@roji

This comment was marked as off-topic.

@ajcvickers ajcvickers added propose-punt punted-for-7.0 Originally planned for the EF Core 7.0 (EF7) release, but moved out due to resource constraints. and removed propose-punt labels Jul 6, 2022
@ajcvickers ajcvickers modified the milestones: 7.0.0, Backlog Jul 7, 2022
@Timovzl
Copy link

Timovzl commented Aug 16, 2022

I think the following are worth taking into consideration for an implementation:

  • Avoid full table scans, since they are infeasible for many data sets. (PARTITION OVER loves to scan everything.)
  • Avoid self-joins, since they are infeasible for certain data sets (i.e. quadratic in group size, bad where any large group exists).
  • A dependent subquery, while not the fastest for every scenario, could help us guarantee a decent space/time complexity (requiring merely an obvious, suitable index).
  • If the model defines the appropriate index as UNIQUE, we can skip efforts to work around groups with multiple (i.e. tied) winners. In fact, we can use a simple and efficient GROUP BY + INNER JOIN in this case.
  • Without a UNIQUE index, if multiple winners occur, we need to know what to do. I could not find this in the documentation, but a simple test indicates that Enumerable.MinBy() will choose the first winner it encounters (but please check the source to be sure). Window functions might be a suitable way to mimic this behavior. This approach would allow the GROUP BY + INNER JOIN solution to be used even in the absence of a UNIQUE index.

For illustration, a UNIQUE index on { ParentId, CreationDateTime } is quite suitable for finding "the latest row for each parent". We can group by parent and then find the latest (MaxBy) element in each group, with the guarantee that each group will have a single latest element:

SELECT i.*

FROM
(
	SELECT ParentId, MAX(CreationDateTime) AS MaxCreationDateTime
	FROM Items
	GROUP BY ParentId
) AS Groups

INNER JOIN Items i
	ON i.ParentId = Groups.ParentId
	AND i.CreationDateTime = Groups.MaxCreationDateTime

-- Without a UNIQUE index, we would partition the result by ParentId and eliminate all non-first rows per partition

@roji
Copy link
Member

roji commented Aug 16, 2022

@Timovzl have you seen the proposal here?

@Timovzl
Copy link

Timovzl commented Aug 17, 2022

@Timovzl have you seen the proposal #25566 (comment)?

@roji Ah, yes, you had already proposed the dependent subquery. And to account for duplicate winners, we could adjust it to the following:

SELECT * FROM employees WHERE id = (SELECT TOP(1) id FROM employees ORDER BY salary DESC)

When writing this manually, I would deduplicate more deliberately, such as by changing the ordering to ORDER BY salary DESC, id DESC, but that should (sadly) be irrelevant for MaxBy.

This proposal is asymptotically optimal, but dependent subqueries do have a significant constant overhead per row they are applied to, which becomes very noticeable on larger data sets. I just did a whole bunch of experiments, and I think we can do better.

I'm going to use a slightly more complex example, where we want the best-earning employee per department, since MaxBy tends to be especially useful with GroupBy. Using dependent subqueries, we might write something like:

SELECT *
FROM employees e
GROUP BY e.department
HAVING e.id = (SELECT TOP(1) id FROM employees e2 WHERE e2.department = e.department ORDER BY e2.salary DESC)

GROUP BY + INNER JOIN to the rescue

Now let's see how we can get speedy joins instead of that pesky dependent subquery.

With a UNIQUE index, the solution is straightforward:

SELECT e.*
FROM
(
	SELECT department, MAX(salary) AS maxSalary
	FROM employees
	GROUP BY department
) AS groups
INNER JOIN employees e ON e.department = groups.department AND e.salary = groups.maxSalary -- Unique

With a non-UNIQUE index (or no index 😱), we could obviously materialize the results and deduplicate in code, but that seems too limiting. To deduplicate efficiently in SQL, we can group by [the thing we wish to deduplicate on], take the first primary key value of each group, and INNER JOIN the target table onto that.

SELECT eUnique.*
FROM
(
	SELECT MAX(e.Id) -- Can do MAX for MaxBy(), MIN for MinBy()
	FROM
	(
		SELECT department, MAX(salary) AS maxSalary
		FROM employees
		GROUP BY department
	) AS groups
	INNER JOIN employees e ON e.department = groups.department AND e.salary = groups.maxSalary -- Non-unique
	GROUP BY e.department, e.salary -- SELECT MAX/MIN removes duplicates here
) AS deduplicated
INNER JOIN employees eUnique ON eUnique.id = deduplicated.id

In fact, EF Core 6 just generated precisely such a query for me when I worked out the equivalent of MaxBy manually:

this.DbContext.Employees

	// Get the top Salary for each Department
	.GroupBy(instance => instance.Department)
	.Select(group => new { Department = group.Key, Salary = group.Max(instance => instance.Salary) })

	// Get the top Id for each { Department, TopSalary } pair
	.Join(this.DbContext.Employees, groupMax => groupMax, instance => new { instance.Department, instance.Salary }, (groupMax, instance) => instance)
	.GroupBy(instance => new { instance.Department, instance.Salary })
	.Select(group => group.Max(instance => instance.Id))

	// Get the Employees by Id
	.Join(this.DbContext.Employees, groupMax => groupMax, instance => instance.Id, (groupMax, instance) => instance);

When you try the solutions on a data set with a few thousand matching groups, you should see the dependent subquery take significantly more time.

Points of attention:

  • For multiple winners, is the chosen one arbitrary or are there rules? If there are rules, it might be hard to determine whether MIN or MAX is correct. If the database supports DESC indexes, we should invert the choice of MIN/MAX. There might be more scenarios. That's a little crazy. No problem if an arbitrary pick is fine, though.
  • For composite primary keys, we would need to repeat the GROUP BY ... SELECT MAX(Id) trick for each next part of the PK [not already included in the original grouping]. We could fall back to the dependent subquery proposal in that case.

@roji
Copy link
Member

roji commented Aug 17, 2022

This proposal is asymptotically optimal, but dependent subqueries do have a significant constant overhead per row they are applied to, which becomes very noticeable on larger data sets. I just did a whole bunch of experiments, and I think we can do better.

Can you please share your experiments? For a query such as the following:

SELECT * FROM employees WHERE id = (SELECT TOP(1) id FROM employees ORDER BY salary DESC)

... I'd expect the subquery to evaluate exactly once - there's no reference to any outer column so no reason to run it per outer row or anything.

I'm going to use a slightly more complex example, where we want the best-earning employee per department [...]

The query your wrote has HAVING, which filters out groups (departments), so I don't think it does what you want (FWIW you probably can't refer to employee ID in the HAVING clause either, since that's not available after aggregating the employees). Am I misunderstanding your sample?

In any case, the precise translation of an operator after GroupBy or without GroupBy may differ, so these are possibly two slightly different conversations.

Re multiple winners, we generally follow the LINQ to Objects behavior (i.e. whatever the enumerable version of MaxBy does), but within limits of pragmatism (sometimes replicating that behavior in SQL isn't feasible or very slow); we'd have to investigate this. I'm imagining that the implementation probably returns the first item with the maximum value based on the existing ordering; so if you do OrderBy(e => e.Foo).MaxBy(e => e.Bar), maybe you get the 1st element (as ordered by Foo) which has the highest Bar value in the set. If so, and if want to duplicate that behavior, we may be able to duplicate it with the subquery translation by pushing the OrderBy down into that subquery. But that requires more investigation (ORDER BY has a cost and we need to make sure we need to do it).

@roji
Copy link
Member

roji commented Aug 17, 2022

This proposal is asymptotically optimal, but dependent subqueries do have a significant constant overhead per row they are applied to, which becomes very noticeable on larger data sets. I just did a whole bunch of experiments, and I think we can do better.

Or here you were simply referring to the time taken to execute that subquery exactly once? If there's an index, then that should be negligible (for either the ordering version or the simpler MAX version). I'm also not sure why any other JOIN-based solution would be necessarily cheaper.

I'd suggest concentrating on the non-GroupBy version to start with, just so we're sure we're aligned on the basics.

@Timovzl
Copy link

Timovzl commented Aug 17, 2022

In any case, the precise translation of an operator after GroupBy or without GroupBy may differ, so these are possibly two slightly different conversations.

I couldn't lay my finger on precisely what the distinction was, but this is it. Indeed, everything I mentioned was from the perspective of MinBy/MaxBy after GroupBy.

I'd expect the subquery to evaluate exactly once - there's no reference to any outer column so no reason to run it per outer row or anything.

Good point. For the non-GroupBy version, there is only a single min/max, and the subquery should be fine.

The query your wrote has HAVING, which filters out groups (departments), so I don't think it does what you want (FWIW you probably can't refer to employee ID in the HAVING clause either, since that's not available after aggregating the employees). Am I misunderstanding your sample?

Oops! I had moved away from the subquery approach quite early on. Hence, the subquery example was off the top of my head, and it was way off. The correct one is more complex. I don't have an example handy, but I saw yesterday that EF writes GROUP BY followed by an OUTER APPLY that uses a subquery to find the min/max for each group.

I'm imagining that the implementation probably returns the first item with the maximum value based on the existing ordering [...]

With LINQ being stable, I have the same expectation.

Correct me if I'm wrong, but I also believe that the promise of being stable is not reflected in LINQ-to-SQL, since the "preexisting order" there normally depends on the index chosen. As such, picking an arbitrary winner is likely fine here, and saves complexity.

This proposal is asymptotically optimal, but dependent subqueries do have a significant constant overhead per row they are applied to [...]

Or here you were simply referring to the time taken to execute that subquery exactly once? If there's an index, then that should be negligible (for either the ordering version or the simpler MAX version). I'm also not sure why any other JOIN-based solution would be necessarily cheaper.

I don't fully understand why, but a subquery applied per row seems to be somehow treated in a "one by one" fashion, whereas a join seems to get executed in more of a batched fashion. So even though both the join and the subquery are O(log(N)) per row they are applied to (which is asymptotically optimal given a B-tree index), the subquery has a much larger constant overhead. It's as if the database is executing the subquery from the top of the tree each time the subquery is applied, while using a single set of hops through the tree to join the entire sequence. Although my explanation is pure interpretation, you should be able to easily observe the performance difference when returning a few thousand results from the respective queries. It's a factor of about 2 to 10, if I recall correctly.

I'd suggest concentrating on the non-GroupBy version to start with, just so we're sure we're aligned on the basics.

😅 My suggestions are probably irrelevant until the GroupBy version comes into view.

@roji
Copy link
Member

roji commented Aug 17, 2022

Correct me if I'm wrong, but I also believe that the promise of being stable is not reflected in LINQ-to-SQL, since the "preexisting order" there normally depends on the index chosen. As such, picking an arbitrary winner is likely fine here, and saves complexity.

When we say preexisting order, that usually only refers to an ordering explicitly established with ORDER BY; so there must be an OrderBy LINQ operator before the MaxBy to establish the order. If there isn't one, the order is unspecified and the query is non-deterministic; it's true that the index plays a role in the actual ordering of the rows, but that something that isn't guaranteed and cannot be relied upon.

To summarize, when there's no OrderBy in LINQ, I think we can definitely pick any arbitrary thing. If there's an OrderBy before the MaxBy, then we can optionally propagate it into the subquery, if we come to the conclusion that it's part of MaxBy's contract.

I don't fully understand why, but a subquery applied per row seems to be somehow treated in a "one by one" fashion, whereas a join seems to get executed in more of a batched fashion.

That sounds odd; I'd carefully test this and share concrete, comparative queries and their plans.

@Timovzl
Copy link

Timovzl commented Jan 17, 2023

That sounds odd; I'd carefully test [the claim of subqueries having a greater constant overhead than joins] and share concrete, comparative queries and their plans.

You're right. Based on further testing, I can now say that my earlier claims regarding subqueries having a greater constant overhead certainly do not apply to SQL Server. It works as you might expect: it interprets the query to understand what you want, and produces a plan. Whether you expressed the query as a join or a dependent subquery makes little difference to it.

I was originally trained on MySQL 5.6, so it is very possible that what I claimed applies to that alone. I have not yet checked the behavior on MySQL 8, although I'm quite curious.

@Timovzl
Copy link

Timovzl commented Jan 17, 2023

I did some further research and discovered that it is fairly challenging to get SQL Server to use an optimal plan for MAX, even with a query that looks ideal. I have yet to write a post about this, but I will share my discoveries for the benefit of this feature.

This work was inspired by this post, which shares some great insights into what's really going on and how to deal with it.

Consider the following straightforward solution:

INDEX IX_Orders_CustomerId_Id(CustomerId, Id)
this.DbContext.Orders

//.Where(x => customerIds.Contains(x.CustomerId)) // Optional

// Find the latest order per distinct customer
.GroupBy(x => x.CustomerId)
.Select(group => group.Max(x => x.Id))

// Join the entity
.Join(this.DbContext.Orders, id => id, instance => instance.Id, (id, instance) => instance);
SET STATISTICS PROFILE ON -- Help check index usage

SELECT [o].*
FROM (
  SELECT MAX(Id) AS [Id]
  FROM [Orders]
  GROUP BY [CustomerId]
) AS [max]
INNER JOIN [Orders] AS [o] ON [o].[Id] = [max].Id

Looks fine, right? Well, it's fine for MIN but not for MAX. SQL Server forward-scans the index, even though MAX was requested. That means that if a customer has 100,000 orders, we are traversing 100,000 orders instead of just 1 for that customer. We can see proof of this in the statistics profile, where the StmtText column indicates ORDERED FORWARD.

Using only basic building blocks, we can more clearly tell the optimizer what we want, even enforcing the expected index without actually referencing it by name:

this.DbContext.Orders

//.Where(x => customerIds.Contains(x.CustomerId)) // Optional

// Seek to each distinct customer
.Select(x => x.CustomerId)
.Distinct()

// For each customer, seek to the top row
.Select(customerId => this.DbContext.Orders
  .Where(x => x.CustomerId == customerId)
  .OrderByDescending(x => x.CustomerId).ThenByDescending(x => x.Id) // Enforce index and direction
  .First().Id)

// Join the entity
.Join(this.DbContext.Orders, id => id, instance => instance.Id, (id, instance) => instance);

EF produces roughly the following T-SQL:

SET STATISTICS PROFILE ON -- Help check index usage

SELECT [o].*
FROM (
  SELECT DISTINCT [CustomerId]
  FROM [Orders]
) AS [max]
INNER JOIN [Orders] AS [o] ON [o].[Id] =
(
  SELECT TOP(1) [o1].[Id]
  FROM [Orders] AS [o1]
  WHERE [o1].[CustomerId] = [max].[CustomerId]
  ORDER BY [o1].[CustomerId] DESC, [o1].[Id] DESC
)

Demo

I know the least about PostgreSQL, so I'm the least confident about its execution plan.

Observations

  • The statistics profile now shows the desired ORDERED BACKWARD.
  • This approach works equally well for MIN (OrderBy) and MAX (OrderByDescending).
  • The min/max value per group does not have to be unique.
  • This approach also works for composite keys.
  • This approach is likely to be interpreted correctly on many RDBMSes, because of the basic building blocks and how much clarity we are giving the optimizer.
  • The full ORDER BY is advisable. Not only does it help enforce the expected index and traversal direction, but MySQL's optimizer has a bug that, in subqueries, fails to recognize that CustomerId was made constant and thus requires it to be explicitly included in the ORDER BY.

For completion, here is an example using a composite key:

INDEX IX_Orders_CustomerId_WebsiteId_Sequence(CustomerId, WebsiteId, Sequence)
this.DbContext.Orders

//.Where(x => customerIds.Contains(x.CustomerId)) // Optional

// Seek to each distinct customer-website pair
.Select(x => new { x.CustomerId, x.WebsiteId })
.Distinct()

// For each customer-website pair, seek to the top row
.Select(tuple => new
{
  CustomerId = tuple.CustomerId,
  WebsiteId = tuple.WebsiteId,
  Sequence = this.DbContext.Orders
    .Where(x => x.CustomerId == tuple.CustomerId && x.WebsiteId == tuple.WebsiteId)
    .OrderByDescending(x => x.CustomerId).ThenByDescending(x => x.WebsiteId).ThenByDescending(x => x.Sequence) // Enforce index and direction
    .First().Sequence)
}

// Join the entity
.Join(
  this.DbContext.Orders,
  groupMax => groupMax,
  instance => new { instance.CustomerId, instance.WebsiteId, instance.Sequence, },
  (groupMax, instance) => instance);

EF produces roughly the following T-SQL:

SET STATISTICS PROFILE ON -- Help check index usage

SELECT [o].*
FROM (
  SELECT DISTINCT [CustomerId], [WebsiteId]
  FROM [Orders]
) AS [max]
INNER JOIN [Orders] AS [o]
  ON [o].[CustomerId] = [max].[CustomerId]
  AND [o].[WebsiteId] = [max].[WebsiteId]
  AND [o].[Sequence] =
  (
    SELECT TOP(1) [o1].[Sequence]
    FROM [Orders] AS [o1]
    WHERE [o1].[CustomerId] = [max].[CustomerId]
    AND [o1].[WebsiteId] = [max].[WebsiteId]
    ORDER BY [o1].[CustomerId] DESC, [o1].[WebsiteId] DESC, [o1].[Sequence] DESC
  )

@roji
Copy link
Member

roji commented Jan 19, 2023

Looks fine, right? Well, it's fine for MIN but not for MAX. SQL Server forward-scans the index, even though MAX was requested. That means that if a customer has 100,000 orders, we are traversing 100,000 orders instead of just 1 for that customer. We can see proof of this in the statistics profile, where the StmtText column indicates ORDERED FORWARD.

@Timovzl I can't reproduce what you're saying here - maybe there's some missing information (in this kind of issue it's always good to post exact schema, exact LINQ queries and SQLs, the exact data you seeded with, and the exact SQL plan that comes out).

Using the console program below to create the schema and seed (5 customers with 10000 orders each), I got the plan for your first SQL:

SET STATISTICS PROFILE ON;

SELECT [o].*
FROM (
    SELECT MIN(Id) AS [Id]
    FROM [Orders]
    GROUP BY [CustomerId]
) AS [max]
    INNER JOIN [Orders] AS [o] ON [o].[Id] = [max].Id

This gives me the following, which indeed goes through 50000 order rows:

+-----+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+--------------------+--------------------+---------------------------------------------------------------------------------------------+--------------------------------------------------------------+------------+----------+-----------+----------+----------------+--------------------------------------------------------------+--------+--------+--------+------------------+
|Rows |Executes|StmtText                                                                                                                                                                            |StmtId|NodeId|Parent|PhysicalOp          |LogicalOp           |Argument                                                                                     |DefinedValues                                                 |EstimateRows|EstimateIO|EstimateCPU|AvgRowSize|TotalSubtreeCost|OutputList                                                    |Warnings|Type    |Parallel|EstimateExecutions|
+-----+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+--------------------+--------------------+---------------------------------------------------------------------------------------------+--------------------------------------------------------------+------------+----------+-----------+----------+----------------+--------------------------------------------------------------+--------+--------+--------+------------------+
|5    |1       |SELECT [o].*                                                                                                                                                                        |1     |1     |0     |null                |null                |null                                                                                         |null                                                          |5           |null      |null       |null      |0.22964673      |null                                                          |null    |SELECT  |false   |null              |
|     |        |FROM (                                                                                                                                                                              |      |      |      |                    |                    |                                                                                             |                                                              |            |          |           |          |                |                                                              |        |        |        |                  |
|     |        |SELECT MIN(Id) AS [Id]                                                                                                                                                              |      |      |      |                    |                    |                                                                                             |                                                              |            |          |           |          |                |                                                              |        |        |        |                  |
|     |        |FROM [Orders]                                                                                                                                                                       |      |      |      |                    |                    |                                                                                             |                                                              |            |          |           |          |                |                                                              |        |        |        |                  |
|     |        |GROUP BY [CustomerId]                                                                                                                                                               |      |      |      |                    |                    |                                                                                             |                                                              |            |          |           |          |                |                                                              |        |        |        |                  |
|     |        |) AS [max]                                                                                                                                                                          |      |      |      |                    |                    |                                                                                             |                                                              |            |          |           |          |                |                                                              |        |        |        |                  |
|     |        |INNER JOIN [Orders] AS [o] ON [o].[Id] = [max].Id                                                                                                                                   |      |      |      |                    |                    |                                                                                             |                                                              |            |          |           |          |                |                                                              |        |        |        |                  |
|5    |1       |  |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1002]))                                                                                                                        |1     |2     |1     |Nested Loops        |Inner Join          |OUTER REFERENCES:([Expr1002])                                                                |null                                                          |5           |0         |0.0000209  |4043      |0.22964673      |[o].[Id], [o].[Name], [o].[CustomerId]                        |null    |PLAN_ROW|false   |1                 |
|5    |1       |       |--Stream Aggregate(GROUP BY:([test].[dbo].[Orders].[CustomerId]) DEFINE:([Expr1002]=MIN([test].[dbo].[Orders].[Id])))                                                       |1     |3     |2     |Stream Aggregate    |Aggregate           |GROUP BY:([test].[dbo].[Orders].[CustomerId])                                                |[Expr1002]=MIN([test].[dbo].[Orders].[Id])                    |5           |0         |0.0300025  |11        |0.21643265      |[Expr1002]                                                    |null    |PLAN_ROW|false   |1                 |
|50000|1       |       |    |--Index Scan(OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId]), ORDERED FORWARD)                                                                                   |1     |4     |3     |Index Scan          |Index Scan          |OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId]), ORDERED FORWARD                       |[test].[dbo].[Orders].[Id], [test].[dbo].[Orders].[CustomerId]|50000       |0.13127315|0.055157   |15        |0.18643014      |[test].[dbo].[Orders].[Id], [test].[dbo].[Orders].[CustomerId]|null    |PLAN_ROW|false   |1                 |
|5    |5       |       |--Clustered Index Seek(OBJECT:([test].[dbo].[Orders].[PK_Orders] AS [o]), SEEK:([o].[Id]=[Expr1002]) ORDERED FORWARD)                                                       |1     |5     |2     |Clustered Index Seek|Clustered Index Seek|OBJECT:([test].[dbo].[Orders].[PK_Orders] AS [o]), SEEK:([o].[Id]=[Expr1002]) ORDERED FORWARD|[o].[Id], [o].[Name], [o].[CustomerId]                        |1           |0.003125  |0.0001581  |4043      |0.013193183     |[o].[Id], [o].[Name], [o].[CustomerId]                        |null    |PLAN_ROW|false   |5                 |
+-----+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+--------------------+--------------------+---------------------------------------------------------------------------------------------+--------------------------------------------------------------+------------+----------+-----------+----------+----------------+--------------------------------------------------------------+--------+--------+--------+------------------+

Changing MAX to MIN gives me the same plan, still with 50000 rows and still with ORDERED FORWARD. The total subtree cost is 0.22964673 for both MAX and MIN variants.

Stepping back, if we concentrate on your grouping subquery:

SELECT MAX([o].[Id]) AS [c]
FROM [Orders] AS [o]
GROUP BY [o].[CustomerId]

Doing the same on the 2nd SQL you posted yields the following:

SELECT [o].*
FROM (
    SELECT DISTINCT [CustomerId]
    FROM [Orders]
) AS [max]
INNER JOIN [Orders] AS [o] ON [o].[Id] =
(
    SELECT TOP(1) [o1].[Id]
    FROM [Orders] AS [o1]
    WHERE [o1].[CustomerId] = [max].[CustomerId]
    ORDER BY [o1].[CustomerId] DESC, [o1].[Id] DESC
);
+-----+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+--------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+-----------+-----------+----------+----------------+--------------------------------------+--------+--------+--------+------------------+
|Rows |Executes|StmtText                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |StmtId|NodeId|Parent|PhysicalOp          |LogicalOp           |Argument                                                                                                                                   |DefinedValues                         |EstimateRows|EstimateIO |EstimateCPU|AvgRowSize|TotalSubtreeCost|OutputList                            |Warnings|Type    |Parallel|EstimateExecutions|
+-----+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+--------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+-----------+-----------+----------+----------------+--------------------------------------+--------+--------+--------+------------------+
|5    |1       |SELECT [o].*                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |1     |1     |0     |null                |null                |null                                                                                                                                       |null                                  |5           |null       |null       |null      |0.21931484      |null                                  |null    |SELECT  |false   |null              |
|     |        |FROM (                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |SELECT DISTINCT [CustomerId]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |FROM [Orders]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |) AS [max]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |INNER JOIN [Orders] AS [o] ON [o].[Id] =                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |SELECT TOP(1) [o1].[Id]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |FROM [Orders] AS [o1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |WHERE [o1].[CustomerId] = [max].[CustomerId]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |ORDER BY [o1].[CustomerId] DESC, [o1].[Id] DESC                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|     |        |)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |      |      |      |                    |                    |                                                                                                                                           |                                      |            |           |           |          |                |                                      |        |        |        |                  |
|5    |1       |  |--Nested Loops(Inner Join, OUTER REFERENCES:([o1].[Id]))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |1     |2     |1     |Nested Loops        |Inner Join          |OUTER REFERENCES:([o1].[Id])                                                                                                               |null                                  |5           |0          |0.0000209  |4043      |0.21931484      |[o].[Id], [o].[Name], [o].[CustomerId]|null    |PLAN_ROW|false   |1                 |
|5    |1       |       |--Nested Loops(Inner Join, OUTER REFERENCES:([test].[dbo].[Orders].[CustomerId]))                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |1     |3     |2     |Nested Loops        |Inner Join          |OUTER REFERENCES:([test].[dbo].[Orders].[CustomerId])                                                                                      |null                                  |5           |0          |0.0000209  |11        |0.21537845      |[o1].[Id]                             |null    |PLAN_ROW|false   |1                 |
|5    |1       |       |    |--Stream Aggregate(GROUP BY:([test].[dbo].[Orders].[CustomerId]))                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |1     |4     |3     |Stream Aggregate    |Aggregate           |GROUP BY:([test].[dbo].[Orders].[CustomerId])                                                                                              |null                                  |5           |0          |0.0250025  |11        |0.21143265      |[test].[dbo].[Orders].[CustomerId]    |null    |PLAN_ROW|false   |1                 |
|50000|1       |       |    |    |--Index Scan(OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId]), ORDERED FORWARD)                                                                                                                                                                                                                                                                                                                                                                                                                                                     |1     |5     |4     |Index Scan          |Index Scan          |OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId]), ORDERED FORWARD                                                                     |[test].[dbo].[Orders].[CustomerId]    |50000       |0.13127315 |0.055157   |11        |0.18643014      |[test].[dbo].[Orders].[CustomerId]    |null    |PLAN_ROW|false   |1                 |
|5    |5       |       |    |--Top(TOP EXPRESSION:((1)))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |1     |6     |3     |Top                 |Top                 |TOP EXPRESSION:((1))                                                                                                                       |null                                  |1           |0          |0.0000001  |11        |0.003924896     |[o1].[Id]                             |null    |PLAN_ROW|false   |5                 |
|5    |5       |       |         |--Index Seek(OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId] AS [o1]), SEEK:([o1].[CustomerId]=[test].[dbo].[Orders].[CustomerId]) ORDERED BACKWARD)                                                                                                                                                                                                                                                                                                                                                                                |1     |8     |6     |Index Seek          |Index Seek          |OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId] AS [o1]), SEEK:([o1].[CustomerId]=[test].[dbo].[Orders].[CustomerId]) ORDERED BACKWARD|[o1].[Id], [o1].[CustomerId]          |1           |0.028310185|0.011157   |15        |0.003923896     |[o1].[Id], [o1].[CustomerId]          |null    |PLAN_ROW|false   |5                 |
|5    |5       |       |--Clustered Index Seek(OBJECT:([test].[dbo].[Orders].[PK_Orders] AS [o]), SEEK:([o].[Id]=[test].[dbo].[Orders].[Id] as [o1].[Id]) ORDERED FORWARD)                                                                                                                                                                                                                                                                                                                                                                                                 |1     |9     |2     |Clustered Index Seek|Clustered Index Seek|OBJECT:([test].[dbo].[Orders].[PK_Orders] AS [o]), SEEK:([o].[Id]=[test].[dbo].[Orders].[Id] as [o1].[Id]) ORDERED FORWARD                 |[o].[Id], [o].[Name], [o].[CustomerId]|1           |0.003125   |0.0001581  |4043      |0.0039155       |[o].[Id], [o].[Name], [o].[CustomerId]|null    |PLAN_ROW|false   |5                 |
+-----+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+--------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+-----------+-----------+----------+----------------+--------------------------------------+--------+--------+--------+------------------+

This still uses the same index, ORDERED FORWARD, and processes 50000 rows. The total subtree cost is 0.21931484, which is indeed better than the above, but doesn't seem to be significant in the way you said above (processing all rows vs. only one per customer).

I'm guessing I've somehow set things up differently than you, so if you'd like to provide the full details I can look again.

Repro program
await using var ctx = new BlogContext();

// Create and seed
// await ctx.Database.EnsureDeletedAsync();
// await ctx.Database.EnsureCreatedAsync();
//
// for (var i = 0; i < 5; i++)
// {
//     var customer = new Customer
//     {
//         Orders = new List<Order>(Enumerable.Range(0, 10_000).Select(_ => new Order()))
//     };
//     ctx.Customers.Add(customer);
// }
//
// ctx.SaveChanges();

// Query1
_ = ctx.Orders
    .GroupBy(x => x.CustomerId)
    .Select(group => group.Max(x => x.Id))
    .Join(ctx.Orders, id => id, instance => instance.Id, (id, instance) => instance)
    .ToArray();

// Query2
_ = ctx.Orders
    .Select(x => x.CustomerId)
    .Distinct()
    .Select(customerId => ctx.Orders
        .Where(x => x.CustomerId == customerId)
        .OrderByDescending(x => x.CustomerId).ThenByDescending(x => x.Id)
        .First().Id)
    .Join(ctx.Orders, id => id, instance => instance.Id, (id, instance) => instance)
    .ToArray();

public class BlogContext : DbContext
{
    public DbSet<Order> Orders { get; set; }
    public DbSet<Customer> Customers { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        => optionsBuilder
            .UseSqlServer(@"Server=localhost;Database=test;User=SA;Password=Abcd5678;Connect Timeout=60;ConnectRetryCount=0;Encrypt=false")
            // .UseNpgsql(@"Host=localhost;Username=test;Password=test")
            .LogTo(Console.WriteLine, LogLevel.Information)
            .EnableSensitiveDataLogging();
}

public class Order
{
    public int Id { get; set; }
    public string? Name { get; set; }

    public int CustomerId { get; set; }
    public Customer Customer { get; set; }
}

public class Customer
{
    public int Id { get; set; }
    public string? Name { get; set; }

    public List<Order> Orders { get; set; }
}

@Timovzl
Copy link

Timovzl commented Jan 19, 2023

I think the SQL Fiddle for SQL Server I provided before should have been to reproduce things, right? Ok, except for the LINQ queries, I admit. In any case, your repro is perfectly suitable.

In your first posted execution plan, this line is the potential problem:

|50000|1       |       |    |--Index Scan(OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId]), ORDERED FORWARD)                                                                                   |1     |4     |3     |Index Scan          |Index Scan          |OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId]), ORDERED FORWARD                       |[test].[dbo].[Orders].[Id], [test].[dbo].[Orders].[CustomerId]|50000       |0.13127315|0.055157   |15        |0.18643014      |[test].[dbo].[Orders].[Id], [test].[dbo].[Orders].[CustomerId]|null    |PLAN_ROW|false   |1                 |

It will do a forward scan on the index. If one or many of the Customers whose Orders we're acquiring have loads of Orders, then we will be scanning all of those Orders until we find the max. At 10,000 Orders per Customer, we're traversing 10,000 rows per Customer instead of 1. This prevents the query from being linearithmic in the number of Customers queried (our asymptotic optimum).

Your second posted execution plan actually solved that problem! 😄

|5    |5       |       |         |--Index Seek(OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId] AS [o1]), SEEK:([o1].[CustomerId]=[test].[dbo].[Orders].[CustomerId]) ORDERED BACKWARD)                                                                                                                                                                                                                                                                                                                                                                                |1     |8     |6     |Index Seek          |Index Seek          |OBJECT:([test].[dbo].[Orders].[IX_Orders_CustomerId] AS [o1]), SEEK:([o1].[CustomerId]=[test].[dbo].[Orders].[CustomerId]) ORDERED BACKWARD|[o1].[Id], [o1].[CustomerId]          |1           |0.028310185|0.011157   |15        |0.003923896     |[o1].[Id], [o1].[CustomerId]          |null    |PLAN_ROW|false   |5                 |

For the pain point discussed above, we are now doing an index seek with ORDERED BACKWARD, thus arriving immediately at the desired Order row for each relevant Customer.

So why, you may ask, is there still an index scan with ORDERED FORWARD in that plan? Well, for two reasons:

  • We are querying the Orders for all Customers. When querying just a subset, you should see this turn into an index seek as well (still ORDERED FORWARD, but for this step that is of no consequence).
  • The number of Orders per Customer is likely low, so an index scan is faster than trying to seek to each Customer. However, if there were a billion Orders spread out over a mere 10,000 Customers, then seeking would be much, much faster, and you should see this turn into an index seek. (I keep forgetting which is low and which is high cardinality.)

As a side note, when querying Orders for all Customers, we could have obtained the distinct CustomerIds more easily, by simply scanning the PK of the Customers table (which is likely far more compact than the Orders table). But I digress - that was not the scan that was affecting the time complexity.

@Timovzl
Copy link

Timovzl commented Aug 13, 2023

I've furthered my research on the above proposal for solving the groupwise min/max problem. I've managed to simplify it and confirm it to be efficient (even with composite keys and non-unique min/max values) on not just SQL Server, but also MySQL.

Building Blocks

An RDBMS need only support the following (fairly basic) combination of features:

  • Indexed GROUP BY.
  • Indexed SELECT with ORDER BY [ASC/DESC] and TOP/LIMIT.
  • Dependent subqueries.

Example

To make things concrete, let's say we're finding the max (i.e. greatest/latest) Order per Customer. We choose groupwise-max because it tends to be harder than groupwise-min, needing reverse seeks to guarantee efficiency.

SELECT (
	-- Find ID of latest order per group
	-- Using index { CustomerId, CreationDateTime, [Id] }
	SELECT TOP 1 o.Id
	FROM Orders o
	WHERE o.CustomerId = groups.CustomerId -- Match the group
	AND o.CreationDateTime < '2023-01-01' -- Optional condition (indexed)
	AND o.IsRareExclusion = 0 -- Optional condition (non-indexed, i.e. scan until hit)
	ORDER BY o.CustomerId DESC, o.CreationDateTime DESC, o.Id DESC
) AS MaxOrderId
-- Find relevant group keys
-- Using index { CustomerId, [...] }
FROM Orders groups
WHERE groups.CustomerId >= 500 -- Optional condition (indexed)
GROUP BY groups.CustomerId

-- From here, we can use a simple join to get the entire entity
-- This is advisable, so that any followup syntax does not intermingle with the above
;

Or expressed in LINQ, and available to us already:

this.DbContext.Orders

// Find each distinct customer with ID >= 500
.Where(x => x.CustomerId >= 500) // Optional condition (indexed)
.GroupBy(x => x.CustomerId)

// For each customer, find its latest order of before 2023
.Select(group => new
{
  CustomerId = group.CustomerId,
  MaxOrderId = this.DbContext.Orders
    .Where(x => x.CustomerId == group.CustomerId) // Match the group
    .Where(x => x.CreationDateTime < new DateTime(2023, 01, 01)) // Optional condition (indexed)
    .Where(x => !x.IsRareExclusion) // Optional condition (non-indexed, i.e. scan until hit)
    // Using index { CustomerId, CreationDateTime, [Id] }
    .OrderByDescending(x => x.CustomerId)
    .ThenByDescending(x => x.CreationDateTime)
    .ThenByDescending(x => x.Id)
    .First().Id,
})

// From here, we can use a simple join to get the entire entity
// This is advisable, so that any followup syntax does not intermingle with the above

The LINQ equivalent with MaxBy could look like this:

this.DbContext.Orders

// Find each distinct customer with ID >= 500
.Where(x => x.CustomerId >= 500) // Optional condition (indexed)
.GroupBy(x => x.CustomerId)

// For each customer, find its latest order of before 2023
.Select(group => group
  .Where(x => x.CreationDateTime < new DateTime(2023, 01, 01)) // Optional condition (indexed)
  .Where(x => !x.IsRareExclusion) // Optional condition (non-indexed, i.e. scan until hit)
  .MaxBy(x => x.CreationDateTime)); // Or explicitly: new { x.CreationDateTime, x.Id }

Or, in a minimal example:

var latestOrderPerCustomer = await this.DbContext.Orders
  .GroupBy(x => x.CustomerId)
  .Select(group => group.MaxBy(x => x.CreationDateTime))
  .ToListAsync();

Performance

As indicated by the execution plan explanations and confirmed on large data sets, the resulting plans are asymptotically optimal. A reverse seek (for groupwise-max) starts at the end of each group, whereas a forward seek (for groupwise-min) starts at the front. We always find the target (or lack thereof) without scanning irrelevant data. This is important, because groups may be very large, such as when a customer can have millions of orders.

Note that MySQL 8 correctly indicates Using index for the dependent subquery, whereas MySQL 5.7 shows a worrisome Using where; Using index. However, both perform as intended, even on huge data sets (100M+ rows) and with very, very large groups.

Explanation

  1. Use GROUP BY to find the desired group keys, e.g. customers.
    • WHERE clauses can be used to filter, e.g. "only active customers".
    • Getting the group keys from another table, such as the Customers table, works just as well.
    • The developer should arrange for the obvious index, taking potential WHERE clauses into account.
  2. Use a dependent subquery to be able to find a single, highly specific row for each group key.
  3. In the dependent subquery, use SELECT ... ORDER BY ... with a TOP 1 to give the RDBMS clear instructions on what to fetch using which index and which seek direction.
    • WHERE clauses can be used to filter, e.g. "only orders from this year".
    • The developer should arrange for a covering index on the entire set of ordering clauses, taking potential WHERE clauses into account.
    • The ascending/descending specifiers should either all match the index's order (ASC by default) or all mismatch it. That way, the ordering represents a forward or backward scan of the index.
    • Composite group keys are no problem: WHERE o.GroupKeyA = group.GroupKeyA AND o.GroupKeyB = group.GroupKeyB.
    • Although ORDER BY o.CustomerId DESC seems superfluous (because o.CustomerId = group.CustomerId), it (A) makes the target index more obvious, and (B) works around a bug in MySQL's optimizer, requiring this to recognize the index if done from a subquery.

MinBy/MaxBy Without GroupBy

Implementing MinBy and MaxBy outside the context of a GroupBy seems fairly trivial. I've only addressed them in the context of GroupBy because that is the hard part, and it's a common use case which they are perfect for addressing.

Further Research

I have not yet investigated how to query the top N (instead of top 1) items per group. It's probably beyond the scope of the current issue. For future reference, one approach that comes to mind is using two dependent subqueries to add two columns to the row: one to get the first matching item per group and one to get the last one. Then, a join could fetch all rows between each min-max pair.

Let me know if anything else is missing.

@roji
Copy link
Member

roji commented Aug 13, 2023

@Timovzl thanks for the above research, it looks very helpful! To set expectations, we're currently heads-down in an intensive EF 8.0 work cycle, finishing feature work for rc1 and starting stabilization. It will be some time before we're able to look at this.

What can be really helpful is if you could share your exact testing/benchmarking scenarios above - a snippet creating the database schema dn seeding it database with data, as well as the concrete query SQL you're proposing here as the MinBy/MaxBy translations.

@Timovzl
Copy link

Timovzl commented Aug 15, 2023

Sure, no expectations about the timeline!

As for testing/benchmarking code, I'll do my best. The concrete usage and queries I can already propose.

@Timovzl
Copy link

Timovzl commented Aug 15, 2023

Implementation Proposal

Ideally, we will translate the MinBy/MaxBy into "lower", already supported expressions. These get translated to SQL correctly even now. I'm actually using precisely these expression in production with SQL Server. With the translations involving only expression juggling, no provider-specific work will be required.

Scenario 1: Ungrouped MaxBy

Translate MaxBy(Selector) into OrderByDecending(Selector).Take(1). Note that Selector could be a tuple, requiring a chain of ThenByDescending().

Simplest form

Using index CreationDateTime, [Id].

// Usage
this.DbContext.Orders
	.MaxBy(x => x.CreationDateTime);
// Proposed translation to lower expressions
this.DbContext.Orders
	.OrderByDescending(x => x.CreationDateTime)
	.Take(1);
-- Approximation of expected SQL
SELECT TOP(1) o.*
FROM Orders o
ORDER BY o.CreationDateTime DESC

With complexities

The actual translation still only has to deal with the tuple in MaxBy. The rest is just distractions.

Using index IsDeleted, CreationDateTime, [Id].

// Usage
this.DbContext.Orders
	.Where(x => !x.IsDeleted)
	.Where(x => x.CreationDateTime < new DateTime(2023, 01, 01)
	.MaxBy(x => new { x.IsDeleted, x.CreationDateTime, x.Id }) // Tip for user: Include x.IsDeleted to match index explicitly
	.Select(x => x.Id);
// Proposed translation to lower expressions
this.DbContext.Orders
	.Where(x => !x.IsDeleted)
	.Where(x => x.CreationDateTime < new DateTime(2023, 01, 01)
	.OrderByDescending(x => x.IsDeleted)
	.ThenByDescending(x => x.CreationDateTime)
	.ThenByDescending(x => x.Id)
	.Take(1)
	.Select(x => x.Id);
-- Approximation of expected SQL
SELECT TOP(1) o.Id
FROM Orders o
WHERE o.IsDeleted = 0
AND o.CreationDateTime < '2023-01-01'
ORDER BY o.IsDeleted DESC, o.CreationDateTime DESC, o.Id DESC

Scenario 2: Grouped MaxBy

This one is better understood from the code, but here is the theory: Recognize a MaxBy inside a Select, where the latter is taking an IGrouping<TKey, TElement> as its source. Translate MaxBy(Selector) into Where(MatchesGroupKey).OrderByDecending(GroupKey).ThenByDescending(Selector).First().Id. Additionally, directly after the Select, perform a Join to obtain the winning rows from their IDs. This has the added benefit of isolating the intricacies of this part of the query, making followup user syntax like Select, Join, or OrderBy work, and without breaking the intended plan.

Simplest form

Using index CustomerId, CreationDateTime, [Id].

// Usage
this.DbContext.Orders
	.GroupBy(x => x.CustomerId)
	.Select(group => group.MaxBy(x => x.CreationDateTime));
// Proposed translation to lower expressions
this.DbContext.Orders
	.GroupBy(x => x.CustomerId)
	.Select(group => this.DbContext.Orders
		.Where(x => x.CustomerId == group.CustomerId)
		.OrderByDescending(x => x.CustomerId)
		.ThenByDescending(x => x.CreationDateTime)
		.First().Id)
	.Join(this.DbContext.Orders, id => id, instance => instance.Id, (id, instance) => instance);
-- Approximation of expected SQL

SELECT o.*

FROM (
	SELECT (
		SELECT TOP(1) o.Id
		FROM Orders o
		WHERE o.CustomerId = groups.CustomerId
		ORDER BY o.CustomerId DESC, o.CreationDateTime DESC
	) AS MaxId
	FROM Orders groups
	GROUP BY groups.CustomerId
) AS Maxes

INNER JOIN Orders o ON o.Id = Maxes.MaxId
;

Having the translation always emit a left-complete ordering (including the CustomerId that was made constant by the Where) helps to (A) clarify the intended index and (B) work around a MySQL optimizer bug that, in subqueries, won't recognize the appropriate index without it.

Notably, in Select(group => [...].MaxBy(x => [...])), MaxBy must be the final expression inside the Select. Attempting something like .Select(group => group.MaxBy(x => x.CreationDateTime).SomeOtherProperty) should result in an untranslatable query. It would prevent us from taking the ID and appending the Join. This constraint should be acceptable: the only use case I can think of is selecting a single property instead of the entire entity. That can still be achieved by adding Select(x => x.SomeOtherProperty at the end of the query.

With complexities

To maximize complexity, we'll use a composite group key (CustomerId, ShopId) and a composite selector (IsDeleted, CreationDateTime). We'll also add some conditions, such as one that cuts of a time window using the index, and another that scans over a few mismatching items.

Using index CustomerId, ShopId, IsDeleted, CreationDateTime, [Id].

// Usage
this.DbContext.Orders
	.Where(x => x.CustomerId > 1000)
	.GroupBy(x => new { x.CustomerId, ShopId })
	.Select(group => group
		.Where(x => !x.IsDeleted)
		.Where(x => x.CreationDateTime < new DateTime(2023, 01, 01))
		.Where(x => !x.IsRareExclusion) // Scan over rare exclusions when finding group max (non-indexed)
		.MaxBy(x => new { x.IsDeleted, x.CreationDateTime }));
// Proposed translation to lower expressions
this.DbContext.Orders
	.Where(x => x.CustomerId > 1000)
	.GroupBy(x => new { x.CustomerId, ShopId })
	.Select(group => this.DbContext.Orders
		.Where(x => !x.IsDeleted) // User condition (indexed)
		.Where(x => x.CreationDateTime < new DateTime(2023, 01, 01)) // User condition (indexed)
		.Where(x => !x.IsRareExclusion) // User condition (non-indexed)
		.Where(x => x.CustomerId == group.CustomerId && x.ShopId == group.ShopId) // Group condition
		.OrderByDescending(x => x.CustomerId)
		.ThenByDescending(x => x.ShopId)
		.ThenByDescending(x => x.IsDeleted)
		.ThenByDescending(x => x.CreationDateTime)
		.First().Id)
	.Join(this.DbContext.Orders, id => id, instance => instance.Id, (id, instance) => instance);
-- Approximation of expected SQL

SELECT o.*

FROM (
	SELECT (
		SELECT TOP(1) o.Id
		FROM Orders o
		WHERE o.IsDeleted = 0
		AND o.CreationDateTime < '2023-01-01'
		AND o.IsRareExclusion = false
		AND o.CustomerId = groups.CustomerId AND o.ShopId = group.ShopId
		ORDER BY o.CustomerId DESC, o.ShopId DESC, o.IsDeleted DESC, o.CreationDateTime DESC
	) AS MaxId
	FROM Orders groups
	WHERE groups.CustomerId > 1000
	GROUP BY groups.CustomerId
) AS Maxes

INNER JOIN Orders o ON o.Id = Maxes.MaxId
;

@roji
Copy link
Member

roji commented Aug 15, 2023

Thanks @Timovzl!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-query customer-reported punted-for-7.0 Originally planned for the EF Core 7.0 (EF7) release, but moved out due to resource constraints. type-enhancement
Projects
None yet
Development

No branches or pull requests

6 participants