encourage WHERE even if splitting #4044

DickBaker · 2022-09-20T13:04:56Z

the code examples will bring back all the Blogs & Posts rows. Better that examples have a WHERE Rating>=4 or somesuch to show how that will also be present in the 2nd SELECT split.
the "cartesian explosion" duplicates only applies when querying Posts with an INCLUDE("Blog") that will have dups of Blogs with multiple Posts, so I would suggest you revise the examples given.
surely the EF code generation should impose a DISTINCT in the split statements to suppress such dups?
I am surprised that the generated split code shown uses the b.BlogId and not the p.BlogId - yes I know they are the same per the ON clause but it rankles when you are really looking for all the p.* fields to match the EF entity definition [just requires more brain-cycles to validate the resulant match!]

Document details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 43baa9f3-c415-c328-6c10-53aabaaec652
Version Independent ID: 4e789a1b-aa88-d194-dc2e-0a23251250d7
Content: Single vs. Split Queries - EF Core
Content Source: entity-framework/core/querying/single-split-queries.md
Product: entity-framework
Technology: entity-framework-core
GitHub Login: @smitpatel
Microsoft Alias: avickers

roji · 2022-09-20T19:45:15Z

the code examples will bring back all the Blogs & Posts rows. Better that examples have a WHERE Rating>=4 or somesuch to show how that will also be present in the 2nd SELECT split.

Bringing back all Blogs & Posts or not seems very orthogonal to what's being explained here. My personal preference is to not burden the user with extra concepts (e.g. a WHERE clause, discussing of bringing back all or some rows), where a simpler sample will do; better to discuss one concept at a time. Besides, there certainly are valid scenarios where you really do want to bring back all the Blogs & Posts, so I'm not sure why we wouldn't show that here.

the "cartesian explosion" duplicates only applies when querying Posts with an INCLUDE("Blogs") that will have dups of Blogs with multiple Posts, so I would suggest you revise the examples given.

I don't follow. The sample shows a scenario where Blogs are queried, and their Posts are included. If the same Blog has two Posts, you get two rows back where the Blog properties are duplicated (that's "cartesian explosion"). What's missing or problematic?

surely the EF code generation should impose a DISTINCT in the split statements to suppress such dups?

Can you explain what you mean? The point is that if the same Blog has two Posts, you get two rows back where the Blog properties are duplicated. However, the Post portion of each row is different, so DISTINCT wouldn't do anything here (DISTINCT only works on entire rows).

DickBaker · 2022-09-20T20:08:52Z

the docs relate to an app doing
var allblogs=ctx.Blogs.Include("Posts").OrderBy(b=>b.BlogId).ToList()
and the Blogs columns would be repeated in the non-split case[you are correct]

I was thinking about the orthogonal [as you like that word!] case of
var allposts=ctx.Posts.Include("Blog").OrderBy(p=>p.PostId).ToList()
which likewise produces an NxM explosion for non-split. However in the split case I would expect gen code to be
SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title]
FROM [Posts] AS [p]
ORDER BY [p].[PostId]

SELECT DISTINCT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
INNER JOIN [Posts] AS [p] ON [p].[BlogId] = [b].[BlogId]
ORDER BY [b].[PostId]

Please note

I omitted the 2nd BlogId in result columns (superfluous)
note the Distinct so you don't get multiples of the Blog cols

Even if you choose to junk my other points (1-4) at the very least please change your example code from "Post" to "Posts" !

roji · 2022-09-20T20:29:43Z

However in the split case I would expect gen code to be [...]

Just to be sure I understand, is this a proposal for an alternative translation for split querying, rather than any docs suggestion?

at the very least please change your example code from "Post" to "Posts"

Sure, do you want to submit a PR for that?

DickBaker · 2022-09-20T21:22:44Z

I still think sticking a WHERE clause is justified so devs see gen'ed code [imho it doesn't distract any more than the OrderBy clause]
I haven't enabled logging [to see what current implementation code actually does] but I was concentrating on improving the docs rather than changing code implementation. I am about to go on short vacation so unfortunately can't explore code/docs further yet [ditto any PR for Posts] but will return in 1 week and will revisit if you haven't done it all by then [I will study #4044 with interest later!]. Thanks.

roji · 2022-09-21T09:04:49Z

SELECT DISTINCT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]

@DickBaker I'm not sure what the DISTINCT operator is supposed to do here: aren't these rows already guaranteed to be distinct? This 2nd query just selects out the matching rows from [Blogs], there are already are no "duplicates" there...

However, I've opened dotnet/efcore#29174 for the idea of doing split queries when starting from Posts, thanks.

roji · 2022-09-21T15:31:21Z

Doing the Post->Posts change as part of #4051.

DickBaker · 2022-09-24T17:00:30Z

what about that "Post"->"Posts" promise in #4051 that Uncle Arthur approved?

that GH page still shows "Post". Maybe its still in the pipeline and yet to reach the spotlight?
https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/querying/single-split-queries.md#split-queries-1

Thanks for undertaking anyway as I didn't relish the responsibility to have raise a PR [no I don't fancy getting my name in lights as a past contributor!], especially as you are intimately involved in all the .NET MD syntax [that I am reticent about using as a novice]
https://learn.microsoft.com/en-us/contribute/dotnet/dotnet-style-guide

regards, Dick

DickBaker · 2022-09-24T17:05:31Z

DISTINCT ...

in the case that I raised arising from
var allposts=ctx.Posts.Include("Blog").OrderBy(p=>p.PostId).ToList()

where you have [say] 1 Blogs row and 2 child Posts the 2nd split would surely surface parent Blog data in duplicate (because of the innerjoin to the 2 child rows) ? hence my assertion that you need the DISTINCT operator. Did I miss something [sorry I still haven't cobbled up the LOG to answer the Q] ?

DickBaker · 2022-09-24T17:21:58Z

back on the subject of plural/singular table names [often argued by zealots to wee small hours, usually involving beer] we are talking about "Post" vs "Posts" [I have always favoured plural in db tables and "DbSet Posts;" with the actual class object being Post, and I agree with EF's singular for linktables like PostTag] ..

reading new issue #4053 and the related docs
https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/modeling/relationships.md

I feel that this definitive source of truth should broach that subject of plural/singular [or at lease have an offpage link to such discussion/rationale].

Yes you would be right if you called me a zealot, but I would shun the term bigot (no, don't ask me for precise definitions of those 2 terms!] But of course I would also be pleased to share beers with you at an opportune moment on this/any subject.

roji · 2022-09-24T17:56:11Z

Re DISTINCT, you're right, I somehow managed to misread your SQL - DISTINCT would indeed remove duplicate Blog rows when the same Blog has more than one Post. It's worth mentioning that DISTINCT adds quite a bit of extra work on the database - in many cases just getting the multiple rows may be more efficient (depending on the average cardinality and the size of the duplicated Blog propertieS).

Re Post/Posts, #4051 isn't yet merged (and even after it is merged, changes are only pushed to live every now and then). For such small changes, we generally recommend to just submit a PR - there's not even any actual markdown involved here.

DickBaker · 2022-09-24T22:00:54Z

Having agreed with the PostTag usage found here
https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/modeling/relationships.md

I now find a rival "PostTags" usage here
https://github.com/dotnet/EntityFramework.Docs/blob/main/samples/core/Modeling/Relationships/FluentAPI/ManyToManyShared.cs?name=ManyToManyShared

so I urge you to ensure consistent message. BTW I have found no explicit indication as to whether the PK of such a join table is (PostId,TagId) or (TagId,PostId) nor whether devs should consider enforcing an additional IX (TagId) or (PostId) respectively [nor how to specify this].

Likewise I am not satisfied that the docs adequately describe the link-table situations of (FKA,FKB) vs the additiional payload columns [and personally I have difficulty accepting Arthur's Dictionary<string,object> flattening as I suspect boxing and loose-typing [but this may be ignorance on my side!]

The changes made EF6->EFCore over complex types (e.g. Address) are also missing.

Re DISTINCT I strongly lean to making the db do the work rather than experience elongated network traffic plus client-side EF having to parse+apply each record. We are both experienced perf practitioners and professors of "it depends", but AFAIK the QP to add the Distinct may add another Sort step but judicial choice of IX (as prev para) should make this cheap.

Reminder that my comments largely to the docs rather than EF code (with possible exception of that DISTINCT wrinkle]
BTW I thoroughly support your Issue #29174 to the EF codebase

Your opinions welcome as ever!

roji · 2022-09-24T22:19:10Z

Re DISTINCT I strongly lean to making the db do the work rather than experience elongated network traffic plus client-side EF having to parse+apply each record.

I'd be very wary of such a sweeping statement; it's generally preferable to remove load from the database, since it's typically much easier to scale the app tier vs. the database tier (i.e. you usually have multiple app servers to a single database server). DISTINCT has quite an impact on how the database constructs the query plan, and generally requires lots of work (see dotnet/efcore#29171 for some similar thoughts on EF's injection of orderings for single query). Outside of EF, I'd carefully take into account actual cardinality and column sizes before adding a DISTINCT; unfortunately, with EF we must make a decision to either do it or not.

Regardless of all that, posting the multiple unrelated doc suggestions in the same issue makes it harder for us to track work and discuss each one. I'd urge opening different issues for different problems, and considering submitting PRs fixing issues, especially when they're small.

ajcvickers · 2022-10-12T20:57:56Z

Ping @roji

roji · 2023-01-09T10:47:46Z

I've gone through this, and I don't think there's anything actionable left here... The Post->Posts change has been merged and applied, and I've just submitted #4201 to generally provide more information on cartesian explosion vs. data duplication.

I do find interesting the idea of adding DISTINCT to the split query SQL, but I'm still not convinced that should be the default behavior. Data duplication doesn't tend to be a big problem - unless huge columns exist on the principal; and adding DISTINCT creates lots of extra work to the database, which in many (and possibly most) cases doesn't justify the network traffic optimization.

So I'm going to go ahead and close this, but as always am happy to discuss further.

DickBaker mentioned this issue Sep 20, 2022

piggyback split queries #4045

Closed

roji mentioned this issue Sep 20, 2022

Split query generation/docs [possible repeat] #4046

Closed

roji closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2023

roji added the closed-question label Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encourage WHERE even if splitting #4044

encourage WHERE even if splitting #4044

DickBaker commented Sep 20, 2022 •

edited

Loading

roji commented Sep 20, 2022

DickBaker commented Sep 20, 2022

roji commented Sep 20, 2022

DickBaker commented Sep 20, 2022 •

edited

Loading

roji commented Sep 21, 2022

roji commented Sep 21, 2022 •

edited

Loading

DickBaker commented Sep 24, 2022

DickBaker commented Sep 24, 2022

DickBaker commented Sep 24, 2022 •

edited

Loading

roji commented Sep 24, 2022

DickBaker commented Sep 24, 2022 •

edited

Loading

roji commented Sep 24, 2022 •

edited

Loading

ajcvickers commented Oct 12, 2022

roji commented Jan 9, 2023

encourage WHERE even if splitting #4044

encourage WHERE even if splitting #4044

Comments

DickBaker commented Sep 20, 2022 • edited Loading

Document details

roji commented Sep 20, 2022

DickBaker commented Sep 20, 2022

roji commented Sep 20, 2022

DickBaker commented Sep 20, 2022 • edited Loading

roji commented Sep 21, 2022

roji commented Sep 21, 2022 • edited Loading

DickBaker commented Sep 24, 2022

DickBaker commented Sep 24, 2022

DickBaker commented Sep 24, 2022 • edited Loading

roji commented Sep 24, 2022

DickBaker commented Sep 24, 2022 • edited Loading

roji commented Sep 24, 2022 • edited Loading

ajcvickers commented Oct 12, 2022

roji commented Jan 9, 2023

DickBaker commented Sep 20, 2022 •

edited

Loading

DickBaker commented Sep 20, 2022 •

edited

Loading

roji commented Sep 21, 2022 •

edited

Loading

DickBaker commented Sep 24, 2022 •

edited

Loading

DickBaker commented Sep 24, 2022 •

edited

Loading

roji commented Sep 24, 2022 •

edited

Loading