Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encourage WHERE even if splitting #4044

Closed
DickBaker opened this issue Sep 20, 2022 · 14 comments
Closed

encourage WHERE even if splitting #4044

DickBaker opened this issue Sep 20, 2022 · 14 comments

Comments

@DickBaker
Copy link
Contributor

DickBaker commented Sep 20, 2022

  1. the code examples will bring back all the Blogs & Posts rows. Better that examples have a WHERE Rating>=4 or somesuch to show how that will also be present in the 2nd SELECT split.
  2. the "cartesian explosion" duplicates only applies when querying Posts with an INCLUDE("Blog") that will have dups of Blogs with multiple Posts, so I would suggest you revise the examples given.
  3. surely the EF code generation should impose a DISTINCT in the split statements to suppress such dups?
  4. I am surprised that the generated split code shown uses the b.BlogId and not the p.BlogId - yes I know they are the same per the ON clause but it rankles when you are really looking for all the p.* fields to match the EF entity definition [just requires more brain-cycles to validate the resulant match!]

Document details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

@roji
Copy link
Member

roji commented Sep 20, 2022

the code examples will bring back all the Blogs & Posts rows. Better that examples have a WHERE Rating>=4 or somesuch to show how that will also be present in the 2nd SELECT split.

Bringing back all Blogs & Posts or not seems very orthogonal to what's being explained here. My personal preference is to not burden the user with extra concepts (e.g. a WHERE clause, discussing of bringing back all or some rows), where a simpler sample will do; better to discuss one concept at a time. Besides, there certainly are valid scenarios where you really do want to bring back all the Blogs & Posts, so I'm not sure why we wouldn't show that here.

the "cartesian explosion" duplicates only applies when querying Posts with an INCLUDE("Blogs") that will have dups of Blogs with multiple Posts, so I would suggest you revise the examples given.

I don't follow. The sample shows a scenario where Blogs are queried, and their Posts are included. If the same Blog has two Posts, you get two rows back where the Blog properties are duplicated (that's "cartesian explosion"). What's missing or problematic?

surely the EF code generation should impose a DISTINCT in the split statements to suppress such dups?

Can you explain what you mean? The point is that if the same Blog has two Posts, you get two rows back where the Blog properties are duplicated. However, the Post portion of each row is different, so DISTINCT wouldn't do anything here (DISTINCT only works on entire rows).

@DickBaker
Copy link
Contributor Author

the docs relate to an app doing
var allblogs=ctx.Blogs.Include("Posts").OrderBy(b=>b.BlogId).ToList()
and the Blogs columns would be repeated in the non-split case[you are correct]

I was thinking about the orthogonal [as you like that word!] case of
var allposts=ctx.Posts.Include("Blog").OrderBy(p=>p.PostId).ToList()
which likewise produces an NxM explosion for non-split. However in the split case I would expect gen code to be
SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title]
FROM [Posts] AS [p]
ORDER BY [p].[PostId]

SELECT DISTINCT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
INNER JOIN [Posts] AS [p] ON [p].[BlogId] = [b].[BlogId]
ORDER BY [b].[PostId]

Please note

  1. I omitted the 2nd BlogId in result columns (superfluous)
  2. note the Distinct so you don't get multiples of the Blog cols

Even if you choose to junk my other points (1-4) at the very least please change your example code from "Post" to "Posts" !

@roji
Copy link
Member

roji commented Sep 20, 2022

However in the split case I would expect gen code to be [...]

Just to be sure I understand, is this a proposal for an alternative translation for split querying, rather than any docs suggestion?

at the very least please change your example code from "Post" to "Posts"

Sure, do you want to submit a PR for that?

@DickBaker
Copy link
Contributor Author

DickBaker commented Sep 20, 2022

I still think sticking a WHERE clause is justified so devs see gen'ed code [imho it doesn't distract any more than the OrderBy clause]
I haven't enabled logging [to see what current implementation code actually does] but I was concentrating on improving the docs rather than changing code implementation. I am about to go on short vacation so unfortunately can't explore code/docs further yet [ditto any PR for Posts] but will return in 1 week and will revisit if you haven't done it all by then [I will study #4044 with interest later!]. Thanks.

@roji
Copy link
Member

roji commented Sep 21, 2022

SELECT DISTINCT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]

@DickBaker I'm not sure what the DISTINCT operator is supposed to do here: aren't these rows already guaranteed to be distinct? This 2nd query just selects out the matching rows from [Blogs], there are already are no "duplicates" there...

However, I've opened dotnet/efcore#29174 for the idea of doing split queries when starting from Posts, thanks.

@roji
Copy link
Member

roji commented Sep 21, 2022

Doing the Post->Posts change as part of #4051.

@DickBaker
Copy link
Contributor Author

what about that "Post"->"Posts" promise in #4051 that Uncle Arthur approved?

that GH page still shows "Post". Maybe its still in the pipeline and yet to reach the spotlight?
https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/querying/single-split-queries.md#split-queries-1

Thanks for undertaking anyway as I didn't relish the responsibility to have raise a PR [no I don't fancy getting my name in lights as a past contributor!], especially as you are intimately involved in all the .NET MD syntax [that I am reticent about using as a novice]
https://learn.microsoft.com/en-us/contribute/dotnet/dotnet-style-guide

regards, Dick

@DickBaker
Copy link
Contributor Author

DISTINCT ...

in the case that I raised arising from
var allposts=ctx.Posts.Include("Blog").OrderBy(p=>p.PostId).ToList()

where you have [say] 1 Blogs row and 2 child Posts the 2nd split would surely surface parent Blog data in duplicate (because of the innerjoin to the 2 child rows) ? hence my assertion that you need the DISTINCT operator. Did I miss something [sorry I still haven't cobbled up the LOG to answer the Q] ?

@DickBaker
Copy link
Contributor Author

DickBaker commented Sep 24, 2022

back on the subject of plural/singular table names [often argued by zealots to wee small hours, usually involving beer] we are talking about "Post" vs "Posts" [I have always favoured plural in db tables and "DbSet Posts;" with the actual class object being Post, and I agree with EF's singular for linktables like PostTag] ..

reading new issue #4053 and the related docs
https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/modeling/relationships.md

I feel that this definitive source of truth should broach that subject of plural/singular [or at lease have an offpage link to such discussion/rationale].

Yes you would be right if you called me a zealot, but I would shun the term bigot (no, don't ask me for precise definitions of those 2 terms!] But of course I would also be pleased to share beers with you at an opportune moment on this/any subject.

@roji
Copy link
Member

roji commented Sep 24, 2022

Re DISTINCT, you're right, I somehow managed to misread your SQL - DISTINCT would indeed remove duplicate Blog rows when the same Blog has more than one Post. It's worth mentioning that DISTINCT adds quite a bit of extra work on the database - in many cases just getting the multiple rows may be more efficient (depending on the average cardinality and the size of the duplicated Blog propertieS).

Re Post/Posts, #4051 isn't yet merged (and even after it is merged, changes are only pushed to live every now and then). For such small changes, we generally recommend to just submit a PR - there's not even any actual markdown involved here.

@DickBaker
Copy link
Contributor Author

DickBaker commented Sep 24, 2022

Having agreed with the PostTag usage found here
https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/modeling/relationships.md

I now find a rival "PostTags" usage here
https://github.com/dotnet/EntityFramework.Docs/blob/main/samples/core/Modeling/Relationships/FluentAPI/ManyToManyShared.cs?name=ManyToManyShared

so I urge you to ensure consistent message. BTW I have found no explicit indication as to whether the PK of such a join table is (PostId,TagId) or (TagId,PostId) nor whether devs should consider enforcing an additional IX (TagId) or (PostId) respectively [nor how to specify this].

Likewise I am not satisfied that the docs adequately describe the link-table situations of (FKA,FKB) vs the additiional payload columns [and personally I have difficulty accepting Arthur's Dictionary<string,object> flattening as I suspect boxing and loose-typing [but this may be ignorance on my side!]

The changes made EF6->EFCore over complex types (e.g. Address) are also missing.

Re DISTINCT I strongly lean to making the db do the work rather than experience elongated network traffic plus client-side EF having to parse+apply each record. We are both experienced perf practitioners and professors of "it depends", but AFAIK the QP to add the Distinct may add another Sort step but judicial choice of IX (as prev para) should make this cheap.

Reminder that my comments largely to the docs rather than EF code (with possible exception of that DISTINCT wrinkle]
BTW I thoroughly support your Issue #29174 to the EF codebase

Your opinions welcome as ever!

@roji
Copy link
Member

roji commented Sep 24, 2022

Re DISTINCT I strongly lean to making the db do the work rather than experience elongated network traffic plus client-side EF having to parse+apply each record.

I'd be very wary of such a sweeping statement; it's generally preferable to remove load from the database, since it's typically much easier to scale the app tier vs. the database tier (i.e. you usually have multiple app servers to a single database server). DISTINCT has quite an impact on how the database constructs the query plan, and generally requires lots of work (see dotnet/efcore#29171 for some similar thoughts on EF's injection of orderings for single query). Outside of EF, I'd carefully take into account actual cardinality and column sizes before adding a DISTINCT; unfortunately, with EF we must make a decision to either do it or not.

Regardless of all that, posting the multiple unrelated doc suggestions in the same issue makes it harder for us to track work and discuss each one. I'd urge opening different issues for different problems, and considering submitting PRs fixing issues, especially when they're small.

@ajcvickers
Copy link
Member

Ping @roji

@roji
Copy link
Member

roji commented Jan 9, 2023

I've gone through this, and I don't think there's anything actionable left here... The Post->Posts change has been merged and applied, and I've just submitted #4201 to generally provide more information on cartesian explosion vs. data duplication.

I do find interesting the idea of adding DISTINCT to the split query SQL, but I'm still not convinced that should be the default behavior. Data duplication doesn't tend to be a big problem - unless huge columns exist on the principal; and adding DISTINCT creates lots of extra work to the database, which in many (and possibly most) cases doesn't justify the network traffic optimization.

So I'm going to go ahead and close this, but as always am happy to discuss further.

@roji roji closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants