-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid cartesian product without query splitting #22435
Comments
@Timovzl thanks for this proposal, we'll study this and respond soon. |
@Timovzl this is definitely an interesting proposal and has the potential to improve single queries. However, note that this technique works only for multiple collection navigations being loaded at the same level; nested navigations cannot be loaded in this way, and so even if your proposal is implemented, split queries remain relevant. There are also some scenarios where your technique will cause increased data to be transferred (due to duplication of the parent rows), but I agree the benefits likely outweigh that. In any case, the 5.0 version is now complete barring bug fixes. We'll look into this more in-depth for 6.0. |
Note when investigating - do actual cross-database benchmarks to measure and compare perf. |
@roji I did not mean to imply that split queries will become irrelevant. Just that is really nice to have an alternative. :)
I do believe this approach can work for multiple levels of nesting as well. It can be applied as follows:
Some points to note: We'll need different names for each splitter quasi-table, e.g. Not only should this approach work for trees of any breadth and depth, but it should be no worse than performing the same query without it! It only applies where a cartesian product is risked, and it avoids it. I'm sure you will do tests to confirm this. I have had good results with it in practice. As with many of EF's features, I think it would be neat if this one could be applied either per-query or on the registration of a DbContext. I can imagine wanting to use this wherever it may apply (as defined above), or simply having a single query where it comes in handy. |
Sure. I meant to say that this technique works only for multiple collection navigations being loaded at the same level, but cannot help cases where a single navigation is loaded at each level (unless I'm missing something). For example, following your example above, if instead of a Parent with two collection navigations Sons and Daughters we have Customers, with navigation Orders, which itself has another navigation LineItems, the cartesian explosion effect still occurs when loading all customers with all their orders and line items. Unless I'm mistaken, that would still be a case where only split queries can help.
This is really too early to say, but it the choice between split and single query already works this way. If we end up implementing the above mode, I don't see why it would be any different. |
You're right in that it doesn't help in this situation, whereas split queries help (at least in theory). But this scenario shouldn't warrant either. Let me illustrate. If I'm not mistaken, the cartesian explosion is another name for the cartesian product. It refers to the multiplication effect that occurs when loading multiple types of 1:N children onto the same parent. For example, if parent A has 100 sons and 100 daughters, there would be no single correct way to combine the sons and daughters, and so, instead, every combination is returned. Instead of 100 + 100 = 200 rows, we get 10.000! That is an explosion indeed. If loading 200 rows is acceptable, then loading 10.000 is very likely unacceptable. Note also how even a single pathological entry may cause command timeouts in the application. In your example, by contrast, we might have 1 Customer with 5 Orders, each with 10 LineItems. The best number of entities we can hope to transfer is When we do this as a single query, we know that we get a row for each of the LineItems. Each row contains 3 entities, with the Orders and Customers potentially repeated. As such, we get I believe the takeaway here is that an increase from linear to quadratic (e.g. 200 to 10.000) is unacceptable and warrants a solution, such as split queries or the solution we are talking about here. After all, the overhead of performing multiple queries is safer and generally better than the overhead of going from 200 to 10.000 rows. However, in the latter example, there is no real explosion going on. The constant increases a bit, and we transfer a bit more data, but the overhead is generally less than that of performing multiple queries. In conclusion, you're right that the proposed solution cannot help with nesting where no multiplication occurs, whereas split queries can, in theory. However, this situation generally needs no help. Note that the proposed solution can help when multiplication occurs on multiple levels (like this). :) |
I agree with your analysis that the problem is usually considerably more severe when the collections are on the same level, then when they are nested. However, it's important to understand that it ultimately depends on your specific scenario andBut data. In the example with 1 Customer/5 Orders/10 LineItems, the columns for the single customer are transmitted 50 times to the client. Now, if the customer happens to have a very heavy field - such as a photo - that data gets duplicated 50 times (just for one customer). With the right data, this could easily be worse than even the 100*100 explosion of the first scenario. So any scenario where data gets duplicated is potentially very dangerous, and split queries eliminate that altogether. But I'm not disagreeing with anything you've written above. We'd have to thoroughly benchmark your proposal across databases to make sure that there's no surprising perf issues in some scenarios (though I don't expect that would happen), and then evaluate the complexity of producing the query and materializing the results back. Beyond that I think this could be a great improvement. |
It is same for same level or nested. 100 collection items on same level is 10,000 rows then 100 collection item for nested is also 10,000 rows. Repetition of parents excluding the very last entity is same. |
@smitpatel We must compare the best possible case to the cartesian product case, to see how much overhead we are causing. 100 sons * 100 daughts = 10.000 children returned, with only 200 distinct children in existence. Having 100 sons who each contain 100 grandsons means there are actually 10.000 descendants in existence! Sure, we're repeating the parent 10.000 times, so there is some growth: the repetition of the parent causes us to return an extra 10.000 entities, and the repetition of the sons another extra 10.000. |
@roji Fair point! I appreciate the level of thought that is being given to this proposal. |
That is my point and has been said by @roji at multiple point. This works really well for some scenarios but not all and we need to evaluate pros and cons of both sides. |
An idea hit me that might let us avoid even the multiplication on single navigations per level. I haven't experimented with it yet, but the theory seems solid. Basically, we would like to repeat as little as possible of any parent in the hierarchy. We could do that by sticking to its primary key, the minimum we need to identify it. We do, however, want that parent's data. Using the same splitter semi-table technique, we could self-join the parent once, separate from its children, in the same way we separate siblings. Let's use SELECT p.Id, pSelf.*, s.*, d.*
FROM Parents p
LEFT JOIN (SELECT 0 AS Id UNION ALL SELECT 1 UNION ALL SELECT 2) AS Splitter ON TRUE
LEFT JOIN Parents pSelf ON pSelf.Id = p.Id AND Splitter.Id = 0
LEFT JOIN Sons s ON s.ParentId = p.Id AND Splitter.Id = 1
LEFT JOIN Daughters d ON d.ParentId = p.Id AND Splitter.Id = 2
WHERE p.Id = 1 A simple example results set might look like this:
Provided that the database sends NULLs efficiently (such as indicated by a bit field, with no further data for that field), this has the potential to negate the effect of the multiplication almost entirely. Of course, we'd need to confirm that the self-join is performed efficiently. |
I had this in mind too when I read your original proposal :) Transferring nulls is generally pretty efficient, but of course we'd need to confirm. Definitely an interesting direction to explore! |
An approach similar to this is already used in EF6. It works like this. The case is where there is one parent table and at least two child tables. Instead of doing SELECT
parent.id AS parent_id,
parent.col1 AS parent_col1,
son.id AS son_id,
son.col1 AS son_col1,
daughter.id AS daughter_id,
daughter.col1 AS daughter_col1
FROM parent
LEFT JOIN son ON parent.id = son.parent_id
LEFT JOIN daughter ON parent.id = daughter.parent_id
ORDER BY parent.id, son.id, daughter.id it does the following SELECT parent_id, parent_col1, son_id, son_col1, daughter_id, daughter_col1, c1
FROM (
(SELECT CASE WHEN (son.id IS NULL) THEN (NULL) ELSE (1) END AS c1,
parent.id AS parent_id,
parent.col1 AS parent_col1,
son.id AS son_id,
son.col1 AS son_col1,
NULL AS daughter_id,
NULL AS daughter_col1
FROM parent LEFT JOIN son ON parent.id = son.parent_id)
UNION ALL
(SELECT 2 AS c1
parent.id AS parent_id,
parent.col1 AS parent_col1,
NULL AS son_id,
NULL AS son_col1,
daughter.id AS daughter_id,
daughter.col1 AS daughter_col1,
FROM parent INNER JOIN daughter ON parent.id = daughter.parent_id)
) AS t
ORDER BY parent_id, c1 Don't know why it fetches all the parent columns the second time though, could easily be replaced by nulls (except for the identifier). |
Both the EF6 strategy and the self-join strategy have the issue that the "parent" table/subquery must be evaluated twice. If it's a complicated one, you might get bad performance. For db engines supporting APPLY / JOIN LATERAL, we can instead do the following: SELECT p.Id, Splitter.Id, p2.*, s.*, d.*
FROM Parents p
INNER JOIN (SELECT 0 AS Id UNION ALL SELECT 1 UNION ALL SELECT 2) AS Splitter ON TRUE
LEFT JOIN LATERAL (SELECT p.*) p2 ON Splitter.Id = 0
LEFT JOIN Sons s ON s.ParentId = p.Id AND Splitter.Id = 1
LEFT JOIN Daughters d ON d.ParentId = p.Id AND Splitter.Id = 2
ORDER BY p.Id, Splitter.Id, s.Id, d.Id In this case no data will be duplicated whatsoever (except the parent identifier). The same approach could be used in case of just one join, if we want to avoid repeated data over the network. |
Another way of avoiding duplicating the outer table for every inner row would be to add a ROW_NUMBER() OVER (PARTITION BY outer.id). We can then use CASE to replace columns with NULLs if the row number is > 1. Then add this row number != 1 as an order by item to have it emitted first. In any case, it's a bit sad that there is no good SQL method of ordering the results so that row with common column values are grouped together. To make our shapers work correctly, we don't need a full sorted result by outer.id, but just that all rows having the same order.id are emitted consecutively. With #24360, we can eliminate some unnecessary orderings, but with any solution explained above where we also remove the duplicates from the outer table, we must add something more to order by to get the order the shapers expect, which might hurt performance at the db side, if the result set is large. If only different db engines could identify this pattern: SELECT * FROM
(SELECT *, ROW_NUMBER() OVER () AS rn FROM (...) AS t1) t1 -- attach an arbitrary distinct number to each row in t1
LEFT JOIN
(SELECT * FROM ...) t2
ORDER BY t1.rn then they could completely eliminate any sorting from the query plan and just perform the left join as is - i.e. emit all t2 items consecutively for each t1 item. At least PostgreSQL performs this query exactly as written, which is sad. |
Query splitting was introduced, but I would personally avoid its drawbacks like the plague.
Luckily, we can usually avoid cartesian products through careful aggregate design. Still, sometimes we truly do want to join multiple 1:N relations. Of course, we would rather not risk a huge data set caused by an incidental high number of child entities.
This problem can be solved without resorting to query splitting.
As an example, say we are selecting one
Parent
and joining both itsSon
s and itsDaughter
s.Basically, since the joined siblings are independent, we have no reason to want them multiplied. This can be accomplished by explicitly instructing the database to keep them separate:
LEFT JOIN (SELECT 1 AS Id UNION ALL SELECT 2) AS Splitter
.Son
's join condition, add:AND Splitter.Id = 1
.Daughter
's join condition, add:AND Splitter.Id = 2
.This gives us:
While transferring a bit more data (mostly in the form of duplication of the
Parent
), the duplication stays linear and well under control.When we combine careful aggregate design with avoiding the cartesian product, we have all the tools we need to load reasonable object graphs without introducing the significant drawbacks of split queries.
The text was updated successfully, but these errors were encountered: