Improve de-correlation to remove duplicate subtrees where possible #270
Labels
effort - high
major issue that will require multiple steps or complex design
enhancement
New feature or request
optimization
Improving the speed/quality of PyDough's outputs
Followup to #141, see the sections regarding de-correlation handling for
SINGULAR_ONLY_MATCH
andAGGREGATION_ONLY_MATCH
. These patterns can be optimized since the RHS of the join is the only one needed since it has all of the data from the LHS, and we don't need to keep every record from the LHS.Suppose the structure of the Hybrid Tree before de-correlation is the following, that tree 3 is derived between operation
G
andH
, and that operationT
contains a correlated reference to a termfoo
defined in operationF
(aka soF
contains a termCORREL(BACK(1).foo)
:Currently, after de-correlation, this becomes the following, where the correlated reference inside
T
now points to a term defined inF'
(so it can be accessed viaBACK(3).foo
instead ofCORREL(BACK(1).foo)
):However, if the connection from tree 1 to tree 3 was originally of the
SINGULAR_ONLY_MATCH
andAGGREGATION_ONLY_MATCH
pattern, we don't need to keep the operators A-G. So we could rearrange the tree to the following form where operation**Z**
is a special access to the child that contains tree 3, which brings all of its data into context but with level 3 considered current data, data from levels 1-2 considered back references, and data from levels 4-5 considered data from child references. Any subsequent back references from tree 1 levels3'
&4
can be rerouted to access terms from level3'
. This special form requires modifications to the Hybrid Tree IR & relational conversion, including a new operation class.If the access was in the
AGGREGATION_ONLY_MATCH
an extra wrinkle is added: all of the data needs to be aggregated from the perspective of Tree 3 Level 3 since levels 4-5 (aka levels 1-2 in the original Tree 3) result in a plural cardinality with regard to Tree 3 Level 3, which means a change in cardinality with respect to Tree 1 Level 3'. The data should be aggregated by the uniqueness keys of Tree 3 levels 1-3. All data referenced from levels 1-3 needs to be passed-through via dummy aggregations (e.g.ANY_VALUE
) if it is not a grouping key. This may require a new operation to place at the bottom of Tree 3. Even if this results in an aggregation with a lot of rows, a lot of keys, and a lot of dummyANY_VALUE
, that is still likely preferable over the duplicate subtree & join in virtually all circumstances, especially the extreme ones that can arise from de-correlation.Note: this entire pattern only arises when the correlated subtree subtree simultaneously has
HAS
called on it while data is accessed from it, such as the examples below:In the first example:
customers
is the equivalent of Tree 1 OperationG
, and the subsequentWHERE
and calc are rest of tree 1 level 3european_country
is the equivalent of tree 3, so theBACK(1).comment
is the correlated reference that exits the subtree but points back tocustomers
from parent treeSINGULAR_ONLY_MATCH
because we need to access thename
property from the child, but we also have aHAS
filter to prune the current level to only include the rows where there is a match to the subtree.In the second example:
customers
is the equivalent of Tree 1 OperationG
, and the subsequentWHERE
and calc are rest of tree 1 level 3selected_orders
is the equivalent of tree 3, so theBACK(1).acctbal
is the correlated reference that exits the subtree but points back tocustomers
from parent treeAGGREGATION_ONLY_MATCH
because we need to access the aggregatedname
property from the child, but we also have aHAS
filter to prune the current level to only include the rows where there is a match to the subtree.key
property ofcustomers
, and passname
through viaANY_VALUE
so if another operator happens afterresult_2
that steps down a level, it can referencename
viaBACK(1)
The text was updated successfully, but these errors were encountered: