-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change split query implementation to fetch dependents by keys, without reevaluating principal query #12776
Comments
Added to #12795 |
With #12098 , this may not be relevant for relational providers. This could be useful for cosmos providers when doing cross-collection includes |
@roji is this still relevant? |
Yeah, this is still relevant - it's a possible optimization for split query. In a nutshell, instead of embedding the principal query in the dependent query (can be very expensive), this would instead just use the keys fetched from the principal query to fetch the dependents. Note that this would make it impossible to batch the two queries, which is something we can do with the current approach (#10878). So there's a trade-off here between reevaluating the principal query twice and doing two roundtrips - IMHO reevaluating the principal query is pretty clearly worse than the additional roundtrip (given that it can be arbitrarily heavy, and overloads the database rather than just adding transfer times). We should probably sit down and think more strategically about where we want to go with related entity loading etc. Note also that this approach is likely the only one possible with at least some non-relational databases, where a complex subquery may not be possible. |
Note: an advantage of this approach is that there's no more need to specify fully deterministic ordering, since EF would be fetching the dependents directly by their IDs. |
Currently for collection include queries, we do split queries. In order to select only related data from 2nd query, we do inner join with first query. This works fine for all cases when first query is simple. If the first query has any client evaluation then in second query, we actually end up doing inner join on client. Which causes us to fetch the first table twice. Even in the absence of client eval, if the first query involves multiple joins due to filtering & ordering then we would recompute the same thing on server.
At least for the scenarios with single column FK, an alternative could be to use key values for filtering the related data table. It is a bargain between N + 1 queries (which uses single key value) & 2 queries which uses whole first query for key values.
Since we don't want to do N+1, this involves some buffering of results on client side. To give details with an example, suppose our default for buffering size is 100. Then for Customer-Orders query,
It gives benefit of reusing same SelectExpression for 2nd query multiple times without missing 2nd level Cache(#12777). We don't run out of memory because we would buffer only a chunk of results. For this example we would run N/100 + 1 queries to get all results.
For the scenarios described in first paragraph, it avoids all the issues they causes.
The text was updated successfully, but these errors were encountered: