Review TreeNode Refactor #1

berkaysynnada · 2024-02-27T14:56:15Z

Here is an explanation of the changes I made and their reasons:

visit and rewrite were skipping the f_up (post_visit) of the node if f_down (pre_visit) results with "jump". This behavior is inconsistent with the post_visit "jump"s. In two passes traversals, "jump" actually does jump to the next f_down if the operation is f_up, or jump to the next f_up if the operation is f_down (bypassing the nodes children). It is important to establish the same behavior.
pop_enter_mark method is adapted to updated behavior of jump.
TODO in LogicalPlan's map_children is resolved.
There is no need to transform method. Old behavior of transform has been changed while its name is preserved. Since it is a simplification and ease-of-use PR rather than adding new features, we should add it with a separate PR when it is really needed. I also think that having the assumption of first top-down and then bottom-up may also be misleading for a general purpose API.
There is no need for handle_transform_recursion anymore.
I have added more comments and done some renaming that I thought were more explanatory.

peter-toth · 2024-02-27T15:15:59Z

Thanks for the fix @berkaysynnada, the changes look good/acceptable to me but let me check it thoroughly tomorrow and come back to you.

datafusion/expr/src/tree_node/plan.rs

peter-toth · 2024-02-27T17:27:13Z

datafusion/expr/src/tree_node/plan.rs

+                self.expressions(),
+                new_children.into_iter().map(|child| child.data).collect(),
+            )
+            .map(Transformed::yes)


Why do you want to propogate up the transformed flag based on the detection result (old_children.into_iter().zip(new_children.iter()).any(|(c1, c2)| c1 != &c2.data)) and not based on if the transformation closure (f) applied on any children reported (propagated up)transformed true state?

I mean I left similar TODOs in DynTreeNode and ConcreteTreeNode where the same conflic between the "detected" and "reported" transformed state can happen.

In my PR I always propagated up the trasnformed state from the closures despite the detection could be different. We can agree on to propagate the detected but then IMO we should change this in DynTreeNode and ConcreteTreeNode as well.

Relevant topic in the main PR: https://github.com/apache/arrow-datafusion/pull/8891/files/e1a499ab51aea3623d1c2754f16fb70e6176d119#r1504355259

I initially thought there may exist some rules such that it transforms the tree in a few iterations, but at the final stage, the final data is the same as the initial data. Therefore, it would be better to check the data's itself rather than trusting the transformed flag. However, now I think that the equality check at each step could be expensive, and I think we need to rely on transformed flag. Otherwise it would have no meaning to carry.

I am converting it to propagate the transformed information up if any children is transformed.

Honestly, my take on this general, user provided transformed flag in the transformation closure results is that it is pointless. In the current code on the main branch it doesn't work at all and so it isn't used anywhere for anything. This tells me that in most of the cases noone misses it. (You know I wanted to remove it entirelly first: apache#8835.)

Now the above doesn't mean that I can't imagine a case where we want to check if changes were made to the tree during a transformation. But I see other options without using the transformed flag.

If it is a tree with Arc "links" and enum nodes like LogicalPlan we can compare the old cloned top level node with the new top level node we got back from transform to detect changes. This comparison is cheap (and similar to the old_children.into_iter().zip(new_children.iter()).any(|(c1, c2)| c1 != &c2.data) above).

If it is a tree with Arc<dyn trait> nodes (Arc<DynTreeNode> and ConcreteTreeNode trees fall into this category) then we can compare the old cloned top level node data ptr addres with the new top level node data prt address we got back from transform to detect changes. This is similar to how with_new_children_if_necessary works.

If we have an Expr tree (with Box links) then it would be costly to clone the tree before the transformation so the only cheap way to detect if changes were made is to ask the API user to provide the information if they changed a node. But instead of requesting the transformed flag in all transformation closures from the API users, I would rather offer an new "transform with payload" like API (Transform with payload apache/datafusion#8664) to let the user propagate arbitrary state during transformation. The transformed flag in the Transformed struct is nothing more than a special payload in that regard.

The options you mentioned make sense, but instead of going through so much effort, we should use the already existing transform. I can't see any advantage of removing Transform either. I also use it a lot in the projection optimizer rule I'm currently working on. As far as I remember, there were also some ideas: apache#8835 (comment).

I have sent a new commit related with these. Could you please take a look? I think you are okey with propagating the transformed result according to the children flags.

Yes, I'm ok with keeping the flag, especially if you are going to use it in rules. But I have some concerns with that commit: #1 (comment)

datafusion/common/src/tree_node.rs

peter-toth · 2024-02-27T17:31:20Z

datafusion/common/src/tree_node.rs

    /// ```
    ///
    /// See [`TreeNodeRecursion`] for more details on how the traversal can be controlled.
    ///
-    /// If `f_down` or `f_up` returns [`Err`], recursion is stopped immediately.


Do we need to remove this? I really need this and I would rather keep this here than adding it back in the next PR.

If you don't like that we reuse the name transform then we can rename it like transform_down_up.

Instead of introducing a new API, could we consider using transform_down followed by transform_up? I'm not trying to make things difficult for you, but I'm just curious if we could achieve the same result by adding an extra line of code where needed.

No, that's not the same. I will need a short form of rewrite() (top-down and bottom-up closures provided for one combined traversal). This is an example how I would like to use such an API:
apache#8891 (comment)

I got your point. Not related with the topic, but I guess contains_pattern in your example deep dives on each node until the leaves, so adding such an f_down may get worse the performance actually.

However, in general there may exist such use cases where f_down can speed up the f_up phase. I'm reverting this change. Can I just ask from you to modify the transform adapting the same behavior with updated visit and rewrite? I have updated the test results, but cannot adapt the transform implementation yet.

I think calculating the pattern bitset will be very cheap. (BTW, I want to do that lazyly.) Also, I think in many cases several subtrees remain unchanged during transformations and the patterns of those subtrees don't need to be recalculated, which means that multiple subsequent transformations can take advantage of some patterns calculated during the first transformation.

I'm reverting this change. Can I just ask from you to modify the transform adapting the same behavior with updated visit and rewrite?

Thanks. Sure, I will do that. (Most likely only tomorrow...)

berkaysynnada

I am sending a commit for the mentioned parts. Thanks for your cooperation :)

datafusion/expr/src/tree_node/plan.rs

berkaysynnada · 2024-02-28T08:33:57Z

datafusion/expr/src/tree_node/plan.rs

+                self.expressions(),
+                new_children.into_iter().map(|child| child.data).collect(),
+            )
+            .map(Transformed::yes)


I initially thought there may exist some rules such that it transforms the tree in a few iterations, but at the final stage, the final data is the same as the initial data. Therefore, it would be better to check the data's itself rather than trusting the transformed flag. However, now I think that the equality check at each step could be expensive, and I think we need to rely on transformed flag. Otherwise it would have no meaning to carry.

I am converting it to propagate the transformed information up if any children is transformed.

datafusion/common/src/tree_node.rs

berkaysynnada · 2024-02-28T08:46:40Z

datafusion/common/src/tree_node.rs

    /// ```
    ///
    /// See [`TreeNodeRecursion`] for more details on how the traversal can be controlled.
    ///
-    /// If `f_down` or `f_up` returns [`Err`], recursion is stopped immediately.


Instead of introducing a new API, could we consider using transform_down followed by transform_up? I'm not trying to make things difficult for you, but I'm just curious if we could achieve the same result by adding an extra line of code where needed.

peter-toth · 2024-02-28T10:45:23Z

I am sending a commit for the mentioned parts. Thanks for your cooperation :)

Sure, thanks for the fixes!

peter-toth · 2024-02-28T13:04:26Z

datafusion/common/src/tree_node.rs

+            let new_children = children.into_iter().map_until_stop_and_collect(f)?;
+            // Propagate up `new_children.transformed` and `new_children.tnr`
+            // along with the node containing transformed children.
+            if new_children.transformed {


Isn't this very risky? You now are relying on if the transformation closures filled the transformed flag correctly. What if they don't? So far noone used that flag for anything and noone checked if the flag is correcty filled. If a closure changes a child node but reports transformed=false (incorrectly) then the node's child will not be changed.
And such incorrect closures can exists not only in Apache DataFusion code, but anything that depends on DataFusion...

I think it would be safer to always call let t2 = self.with_new_arc_children(arc_self, t.data)?;. What I'm undecided about is if we should return:

Ok(Transformed::new(t2.data, t.transformed, t.tnr))

or Ok(Transformed::new(t2.data, t2.transformed, t.tnr))

or Ok(Transformed::new(t2.data, t.transformed || t2.transformed, t.tnr))

I can understand your concern, but we are providing an API and I think we should assume it is used correctly. If we provide a good documentation, I don't think there will be any problems. It is clear and not hard to understand, so I don't think someone would return modified data as transformed::no. Carrying this information along would be a burden if we're not going to use it.

Maybe we can change the default value of the methods that create new objects to transformed::yes, or we can close the direct access to data, and force users to modify it through a method (set_data()), and when this method is run at least once, we can automatically set the transformed to yes.

Alright, let's run the tests and see if the current closures use the flag correctly.

BTW, why is ConcreteTreeNode different? As far as I see you didn't apply the if new_children.transformed check there..

tnr needs to come from new_children, returning it from self.with_new_arc_children(arc_self, new_children.data) will not work.

How about:

if new_children.transformed { new_children.map_data(|data| { let arc_self = Arc::clone(&self); self.with_new_arc_children(arc_self, data).map(|t| t.data) }) } else { Ok(Transformed::no(self)) }

BTW, why is ConcreteTreeNode different? As far as I see you didn't apply the if new_children.transformed check there..

You are right. I thought with_new_children already does that propagation from children, but it only checks plans, not the data. I am updating it.

tnr needs to come from new_children, returning it from self.with_new_arc_children(arc_self, new_children.data) will not work.

if new_children.transformed is true, I believe there is no possibility of that self.with_new_arc_children(arc_self, new_children.data) returns transformed::no. Therefore I thought the existing implementation is safe.

tnr needs to come from new_children, returning it from self.with_new_arc_children(arc_self, new_children.data) will not work.

if new_children.transformed is true, I believe there is no possibility of that self.with_new_arc_children(arc_self, new_children.data) returns transformed::no. Therefore I thought the existing implementation is safe.

I'm not talking about .transformed but about the .tnr field of the result. self.with_new_arc_children(arc_self, new_children.data) always returns TreeNodeRecursion::Continue so we need to return .trn from new_children.

berkaysynnada · 2024-02-29T08:37:08Z

I believe this PR is ready after the transform behavior is updated according to tests. I will add more comments to transform with an example scenario of its usage. It seems easy to understand and use to us because we have been dealing with it for a while, but it may be a difficult API to understand and use for someone using TreeNode for the first time.

peter-toth · 2024-02-29T08:48:24Z

Looks good to me now. Let me merge this PR and fix the transform behaviour on the main PR.

I see that you renamed transform to transform_down_up_with_control in dbe24db. Is it a better name for it? I'm ok with that, but I don't get the with_control part.

berkaysynnada · 2024-02-29T10:19:09Z

Looks good to me now. Let me merge this PR and fix the transform behaviour on the main PR.

I see that you renamed transform to transform_down_up_with_control in dbe24db. Is it a better name for it? I'm ok with that, but I don't get the with_control part.

I tried to emphasize the difference of calling transform_down then transform_up with that. Renamed it transform_down_up only now. Maybe you can show up with better ones. It is an API that should be used with caution and I would like to emphasize that it is quite different from the old transform.

The only TODO is converting transform and related macros to apply f_up on the jumped node also.

peter-toth · 2024-02-29T10:32:39Z

Ok, thanks! I will merge this PR later today and fix the TODO.

* refactor `TreeNode::rewrite()` * use handle_tree_recursion in `Expr` * use macro for transform recursions * fix api * minor fixes * fix * don't trust `t.transformed` coming from transformation closures, keep the old way of detecting if changes were made * rephrase todo comment, always propagate up `t.transformed` from the transformation closure, fix projection pushdown closure * Fix `TreeNodeRecursion` docs * extend Skip (Prune) functionality to Jump as it is defined in https://synnada.notion.site/synnada/TreeNode-Design-Proposal-bceac27d18504a2085145550e267c4c1 * fix Jump and add tests * jump test fixes * fix clippy * unify "transform" traversals using macros, fix "visit" traversal jumps, add visit jump tests, ensure consistent naming `f` instead of `op`, `f_down` instead of `pre_visit` and `f_up` instead of `post_visit` * fix macro rewrite * minor fixes * minor fix * refactor tests * add transform tests * add apply, transform_down and transform_up tests * refactor tests * test jump on both a and e nodes in both top-down and bottom-up traversals * better transform/rewrite tests * minor fix * simplify tests * add stop tests, reorganize tests * fix previous merges and remove leftover file * Review TreeNode Refactor (#1) * Minor changes * Jump doesn't ignore f_up * update test * Update rewriter * LogicalPlan visit update and propagate from children flags * Update tree_node.rs * Update map_children's --------- Co-authored-by: Mustafa Akur <mustafa.akur@synnada.ai> * fix * minor fixes * fix f_up call when f_down returns jump * simplify code * minor fix * revert unnecessary changes * fix `DynTreeNode` and `ConcreteTreeNode` `transformed` and `tnr` propagation * introduce TransformedResult helper * fix docs * restore transform as alias to trassform_up * restore transform as alias to trassform_up 2 * Simplifications and comment improvements (#2) --------- Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> Co-authored-by: Mustafa Akur <mustafa.akur@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>

berkaysynnada and others added 4 commits February 26, 2024 15:17

Minor changes

f079fce

Jump doesn't ignore f_up

988dd28

update test

db9664f

Update rewriter

56e72d8

github-actions bot added logical-expr optimizer labels Feb 27, 2024

peter-toth reviewed Feb 27, 2024

View reviewed changes

datafusion/expr/src/tree_node/plan.rs Outdated Show resolved Hide resolved

peter-toth reviewed Feb 27, 2024

View reviewed changes

datafusion/common/src/tree_node.rs Show resolved Hide resolved

peter-toth reviewed Feb 27, 2024

View reviewed changes

peter-toth mentioned this pull request Feb 27, 2024

Consolidate TreeNode transform and rewrite APIs apache/datafusion#8891

Merged

berkaysynnada commented Feb 28, 2024

View reviewed changes

berkaysynnada added 2 commits February 28, 2024 14:23

LogicalPlan visit update and propagate from children flags

84c34b5

Update tree_node.rs

3841067

peter-toth reviewed Feb 28, 2024

View reviewed changes

Update map_children's

dbe24db

berkaysynnada force-pushed the refactor-treenode-minor branch from cb7f965 to dbe24db Compare February 29, 2024 08:32

Add example

5d52a44

peter-toth merged commit 271a0fd into peter-toth:refactor-treenode-rewrite Feb 29, 2024
1 check passed

peter-toth mentioned this pull request Mar 21, 2024

fix(9678): short circuiting prevented population of visited stack, for common subexpr elimination optimization apache/datafusion#9685

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review TreeNode Refactor #1

Review TreeNode Refactor #1

berkaysynnada commented Feb 27, 2024

peter-toth commented Feb 27, 2024

peter-toth Feb 27, 2024

peter-toth Feb 27, 2024

berkaysynnada Feb 28, 2024

peter-toth Feb 28, 2024

berkaysynnada Feb 28, 2024 •

edited

Loading

peter-toth Feb 28, 2024 •

edited

Loading

peter-toth Feb 27, 2024

peter-toth Feb 28, 2024

berkaysynnada Feb 28, 2024

peter-toth Feb 28, 2024

berkaysynnada Feb 28, 2024

peter-toth Feb 28, 2024 •

edited

Loading

berkaysynnada left a comment

berkaysynnada Feb 28, 2024

berkaysynnada Feb 28, 2024

peter-toth commented Feb 28, 2024

peter-toth Feb 28, 2024 •

edited

Loading

berkaysynnada Feb 28, 2024

peter-toth Feb 28, 2024 •

edited

Loading

peter-toth Feb 28, 2024

berkaysynnada Feb 29, 2024

berkaysynnada Feb 29, 2024

peter-toth Feb 29, 2024

berkaysynnada commented Feb 29, 2024

peter-toth commented Feb 29, 2024 •

edited

Loading

berkaysynnada commented Feb 29, 2024

peter-toth commented Feb 29, 2024

Review TreeNode Refactor #1

Review TreeNode Refactor #1

Conversation

berkaysynnada commented Feb 27, 2024

peter-toth commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

peter-toth Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

berkaysynnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth commented Feb 28, 2024

peter-toth Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada commented Feb 29, 2024

peter-toth commented Feb 29, 2024 • edited Loading

berkaysynnada commented Feb 29, 2024

peter-toth commented Feb 29, 2024

berkaysynnada Feb 28, 2024 •

edited

Loading

peter-toth Feb 28, 2024 •

edited

Loading

peter-toth Feb 28, 2024 •

edited

Loading

peter-toth Feb 28, 2024 •

edited

Loading

peter-toth Feb 28, 2024 •

edited

Loading

peter-toth commented Feb 29, 2024 •

edited

Loading