Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve de-correlation to remove duplicate subtrees where possible #270

Open
knassre-bodo opened this issue Feb 18, 2025 · 0 comments
Open
Labels
effort - high major issue that will require multiple steps or complex design enhancement New feature or request optimization Improving the speed/quality of PyDough's outputs

Comments

@knassre-bodo
Copy link
Contributor

Followup to #141, see the sections regarding de-correlation handling for SINGULAR_ONLY_MATCH and AGGREGATION_ONLY_MATCH. These patterns can be optimized since the RHS of the join is the only one needed since it has all of the data from the LHS, and we don't need to keep every record from the LHS.

Suppose the structure of the Hybrid Tree before de-correlation is the following, that tree 3 is derived between operation G and H, and that operation T contains a correlated reference to a term foo defined in operation F (aka so F contains a term CORREL(BACK(1).foo):

Tree 1 Level 1 [A → B → C]
│ │
│ └─ Tree 2 Level 1 [M → N]
│    │
│    Tree 2 Level 2 [O → P]
│
Tree 1 Level 2 [D → E → **F**]
│  
Tree 1 Level 3 [G → H → I]
│ │
│ ├─ Tree 3 Level 1 [Q → R]
│ │  │
│ │  Tree 3 Level 2 [S → **T**]
│ │
│ └─ Tree 4 Level 1 [U → V]
│    │
│    Tree 4 Level 2 [W → X]
│
Tree 1 Level 4 [J → K → L]

Currently, after de-correlation, this becomes the following, where the correlated reference inside T now points to a term defined in F' (so it can be accessed via BACK(3).foo instead of CORREL(BACK(1).foo)):

Tree 1 Level 1 [A → B → C]
│ │
│ └─ Tree 2 Level 1 [M → N]
│    │
│    Tree 2 Level 2 [O → P]
│
Tree 1 Level 2 [D → E → F]
│  
Tree 1 Level 3 [G → H → I]
│ │
│ ├─ Tree 3 Level 1 [A' → B' → C']
│ │  │ │
│ │  │ └─ Tree 5 Level 1 [M' → N']
│ │  │    │
│ │  │    Tree 5 Level 2 [O' → P']
│ │  │
│ │  Tree 3 Level 2 [D' → E' → **F'**]
│ │  │
│ │  Tree 3 Level 3 [G]
│ │  │
│ │  Tree 3 Level 4 [Q → R]
│ │  │
│ │  Tree 3 Level 5 [S → **T**]
│ │
│ └─ Tree 4 Level 1 [U → V]
│    │
│    Tree 4 Level 2 [W → X]
│
Tree 1 Level 4 [J → K → L]

However, if the connection from tree 1 to tree 3 was originally of the SINGULAR_ONLY_MATCH and AGGREGATION_ONLY_MATCH pattern, we don't need to keep the operators A-G. So we could rearrange the tree to the following form where operation **Z** is a special access to the child that contains tree 3, which brings all of its data into context but with level 3 considered current data, data from levels 1-2 considered back references, and data from levels 4-5 considered data from child references. Any subsequent back references from tree 1 levels 3' & 4 can be rerouted to access terms from level 3'. This special form requires modifications to the Hybrid Tree IR & relational conversion, including a new operation class.

Tree 1 Level 3' [**Z** → H → I]
│ │
│ ├─ Tree 3 Level 1' [A' → B' → C']
│ │  │ │
│ │  │ └─ Tree 5 Level 1 [M' → N']
│ │  │    │
│ │  │    Tree 5 Level 2 [O' → P']
│ │  │
│ │  Tree 3 Level 2 [D' → E' → **F'**]
│ │  │
│ │  Tree 3 Level 3 [G]
│ │  │
│ │  Tree 3 Level 4 [Q → R]
│ │  │
│ │  Tree 3 Level 5 [S → **T**]
│ │
│ └─ Tree 4 Level 1 [U → V]
│    │
│    Tree 4 Level 2 [W → X]
│
Tree 1 Level 4 [J → K → L]

If the access was in the AGGREGATION_ONLY_MATCH an extra wrinkle is added: all of the data needs to be aggregated from the perspective of Tree 3 Level 3 since levels 4-5 (aka levels 1-2 in the original Tree 3) result in a plural cardinality with regard to Tree 3 Level 3, which means a change in cardinality with respect to Tree 1 Level 3'. The data should be aggregated by the uniqueness keys of Tree 3 levels 1-3. All data referenced from levels 1-3 needs to be passed-through via dummy aggregations (e.g. ANY_VALUE) if it is not a grouping key. This may require a new operation to place at the bottom of Tree 3. Even if this results in an aggregation with a lot of rows, a lot of keys, and a lot of dummy ANY_VALUE, that is still likely preferable over the duplicate subtree & join in virtually all circumstances, especially the extreme ones that can arise from de-correlation.

Note: this entire pattern only arises when the correlated subtree subtree simultaneously has HAS called on it while data is accessed from it, such as the examples below:

# Example 1: finding European customers whose comment's include their nation name,
# and accesses their country's name (SINGULAR_ONLY_MATCH)
european_country = nation.WHERE((region.name == "EUROPE") & CONTAINS(BACK(1).comment, LOWER(name)))
result_1 = customers.WHERE(HAS(european_country))(name, nation_name=european_country.name)

# Example 2: finding the average package order price per-customer of their orders
# where the order price is at least 1% of their current account balance, but only from
# customers who made at least one such order (AGGREGATION_ONLY_MATCH)
selected_orders = orders.WHERE(total_price >= (BACK(1).acctbal * 0.01))
result_2 = customers.WHERE(HAS(selected_orders))(name, selected_orders=AVG(orders_95.total_price))

In the first example:

  • customers is the equivalent of Tree 1 Operation G, and the subsequent WHERE and calc are rest of tree 1 level 3
  • european_country is the equivalent of tree 3, so the BACK(1).comment is the correlated reference that exits the subtree but points back to customers from parent tree
  • The access from Tree 1 to Tree 3 is SINGULAR_ONLY_MATCH because we need to access the name property from the child, but we also have a HAS filter to prune the current level to only include the rows where there is a match to the subtree.

In the second example:

  • customers is the equivalent of Tree 1 Operation G, and the subsequent WHERE and calc are rest of tree 1 level 3
  • selected_orders is the equivalent of tree 3, so the BACK(1).acctbal is the correlated reference that exits the subtree but points back to customers from parent tree
  • The access from Tree 1 to Tree 3 is AGGREGATION_ONLY_MATCH because we need to access the aggregated name property from the child, but we also have a HAS filter to prune the current level to only include the rows where there is a match to the subtree.
  • In the decorrelated version, would need to group the subtree by the key property of customers, and pass name through via ANY_VALUE so if another operator happens after result_2 that steps down a level, it can reference name via BACK(1)
@knassre-bodo knassre-bodo added effort - high major issue that will require multiple steps or complex design enhancement New feature or request optimization Improving the speed/quality of PyDough's outputs labels Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort - high major issue that will require multiple steps or complex design enhancement New feature or request optimization Improving the speed/quality of PyDough's outputs
Projects
None yet
Development

No branches or pull requests

1 participant