Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pr/9813 experimentation #1

Conversation

mustafasrepo
Copy link

@mustafasrepo mustafasrepo commented Mar 27, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

@suremarc I reviewed your PR9812. I devised an alternative solution, which I think more general and support multi children. If you are Ok with these changes, we can continue with this solution in the upstream.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copy link

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mustafasrepo and @suremarc -- I am a little confused about what is going on here

@@ -167,6 +168,11 @@ impl PhysicalExpr for CastExpr {
fn get_ordering(&self, children: &[SortProperties]) -> SortProperties {
children[0]
}

/// If it is known that [`CastExpr`] have an ordering. Its input should have same ordering.
fn get_children_ordering(&self, ordering: SortOptions) -> Vec<SortProperties> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add an implementation for function calls that used the declared monotonicity of the function too

https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.monotonicity

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example for this use case.

@@ -226,6 +227,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + PartialEq<dyn Any> {
fn get_ordering(&self, _children: &[SortProperties]) -> SortProperties {
SortProperties::Unordered
}

/// The order information of the children expressions based the ordering of the expression.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused about the need for this API -- why isn't the calculation of orderings from children enough?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this implementation is the reverse of the deduction from children orderings. The problem in the original issue is that, when table has ordering: [a ASC, b ASC, c ASC]. Assume we apply following predicate a=round(c) in the filter. In this case, after filtering we know that a and round(c) would be same from now on. This enables us to deduce round(c) is itself ordered. Which in turn, enables us to deduce c would be ordered.

In this case constraint (e.g ordering of the expression round(c)) comes from the analysis. This newly introduced API enables us to deduce given that output ordering, its children (e.g c) would have corresponding ordering.

With this analysis ORDER BY c ASC wouldn't add new SortExec to the plan. Since it is known that after filter, this requirement is already satisfied. Unfortunately I couldn't come up with solution without introducing a new API.

Copy link
Collaborator

@suremarc suremarc Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem correct to me to say that round(c) being ordered implies c is ordered. For runs that have the same value for round(c), the ordering of c can be arbitrary. For example, if the data is ordered by c and we project round(c), followed by a SortPreservingMergeStream on round(c), we have no guarantees that after merging the data will still be ordered by c, even though it was originally.

It seems only true if the data is sorted by [round(c), c], as you need c to be the tiebreaker. But for [round(c), b, c] it will not be true that the output is ordered by c unless you know b is constant when round(c) is. So it seems to me one needs to take into account that round(c) and c are next to each other in the sort order.

@@ -167,6 +168,11 @@ impl PhysicalExpr for CastExpr {
fn get_ordering(&self, children: &[SortProperties]) -> SortProperties {
children[0]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a cast really preserves the ordering? I would have thought that casting numbers to strings would change the ordering for example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you are right, we should add a check something like whether cast is between similar types

];
// If round(c) is a monotonic expression of c, then `order by c` is satisfied given `round(c)=a` (where `a` is leading ordering).
for (expr, (expected_expr, expected_options)) in test_exprs {
properties.add_equal_conditions(&expr, &col("a", schema.as_ref())?);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do properties.clone() here? Otherwise the test cases might affect one another since we're mutating the properties.

@@ -226,6 +227,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + PartialEq<dyn Any> {
fn get_ordering(&self, _children: &[SortProperties]) -> SortProperties {
SortProperties::Unordered
}

/// The order information of the children expressions based the ordering of the expression.
Copy link
Collaborator

@suremarc suremarc Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem correct to me to say that round(c) being ordered implies c is ordered. For runs that have the same value for round(c), the ordering of c can be arbitrary. For example, if the data is ordered by c and we project round(c), followed by a SortPreservingMergeStream on round(c), we have no guarantees that after merging the data will still be ordered by c, even though it was originally.

It seems only true if the data is sorted by [round(c), c], as you need c to be the tiebreaker. But for [round(c), b, c] it will not be true that the output is ordered by c unless you know b is constant when round(c) is. So it seems to me one needs to take into account that round(c) and c are next to each other in the sort order.

@mustafasrepo
Copy link
Author

It doesn't seem correct to me to say that round(c) being ordered implies c is ordered. For runs that have the same value for round(c), the ordering of c can be arbitrary. For example, if the data is ordered by c and we project round(c), followed by a SortPreservingMergeStream on round(c), we have no guarantees that after merging the data will still be ordered by c, even though it was originally.

You are right. I think, with this finding it better to move on with your original PR9813

@mustafasrepo mustafasrepo marked this pull request as draft March 28, 2024 06:21
matthewmturner pushed a commit that referenced this pull request Apr 25, 2024
…aTypes) (apache#8985)

* ScalarValue return types from argument values

* change file name

* try using ?Sized

* use Ok

* move method default impl outside trait

* Use type trait for ExprSchemable

* fix nit

* Proposed Return Type from Expr suggestions (#1)

* Improve return_type_from_args

* Rework example

* Update datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

---------

Co-authored-by: Junhao Liu <junhaoliu2023@gmail.com>

* Apply suggestions from code review

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Fix tests + clippy

* rework types to use dyn trait

* fmt

* docs

* Apply suggestions from code review

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

* Add docs explaining what happens when both `return_type` and `return_type_from_exprs` are called

* clippy

* fix doc -- comedy of errors

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
matthewmturner pushed a commit that referenced this pull request Apr 25, 2024
* refactor `TreeNode::rewrite()`

* use handle_tree_recursion in `Expr`

* use macro for transform recursions

* fix api

* minor fixes

* fix

* don't trust `t.transformed` coming from transformation closures, keep the old way of detecting if changes were made

* rephrase todo comment, always propagate up `t.transformed` from the transformation closure, fix projection pushdown closure

* Fix `TreeNodeRecursion` docs

* extend Skip (Prune) functionality to Jump as it is defined in https://synnada.notion.site/synnada/TreeNode-Design-Proposal-bceac27d18504a2085145550e267c4c1

* fix Jump and add tests

* jump test fixes

* fix clippy

* unify "transform" traversals using macros, fix "visit" traversal jumps, add visit jump tests, ensure consistent naming `f` instead of `op`, `f_down` instead of `pre_visit` and `f_up` instead of `post_visit`

* fix macro rewrite

* minor fixes

* minor fix

* refactor tests

* add transform tests

* add apply, transform_down and transform_up tests

* refactor tests

* test jump on both a and e nodes in both top-down and bottom-up traversals

* better transform/rewrite tests

* minor fix

* simplify tests

* add stop tests, reorganize tests

* fix previous merges and remove leftover file

* Review TreeNode Refactor (#1)

* Minor changes

* Jump doesn't ignore f_up

* update test

* Update rewriter

* LogicalPlan visit update and propagate from children flags

* Update tree_node.rs

* Update map_children's

---------

Co-authored-by: Mustafa Akur <mustafa.akur@synnada.ai>

* fix

* minor fixes

* fix f_up call when f_down returns jump

* simplify code

* minor fix

* revert unnecessary changes

* fix `DynTreeNode` and `ConcreteTreeNode` `transformed` and `tnr` propagation

* introduce TransformedResult helper

* fix docs

* restore transform as alias to trassform_up

* restore transform as alias to trassform_up 2

* Simplifications and comment improvements (#2)

---------

Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com>
Co-authored-by: Mustafa Akur <mustafa.akur@synnada.ai>
Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 28, 2024
@mustafasrepo
Copy link
Author

Thanks @suremarc for collaboration, closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants