Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebase_expr should probably use transform_down and not transform_up #8280

Closed
gruuya opened this issue Nov 20, 2023 · 1 comment · Fixed by #8890
Closed

rebase_expr should probably use transform_down and not transform_up #8280

gruuya opened this issue Nov 20, 2023 · 1 comment · Fixed by #8890
Labels
bug Something isn't working

Comments

@gruuya
Copy link
Contributor

gruuya commented Nov 20, 2023

Describe the bug

While working with first_value aggregation function I noticed what seems to be a minute, albeit general, deficiency in the rebase_expr function, which in the case below leads to an error.

Namely, when re-basing an expression against a list of provided base expressions transformations are done in a post-order fashion:
https://github.com/apache/arrow-datafusion/blob/2156dde54623d26635d4388d161d94ac79918cdc/datafusion/sql/src/utils.rs#L64-L69

The issue below arises when this list of base expressions contains overlapping elements, e.g. the first base expression (group by clause) is a sub-expression of the second base expression (the first_value aggregation). Since the transformation is done inside-out, once the first sub-expression is columnized, the original expression gets changed and so it can't be re-based anymore (even when the original root expression is also an available base expression as is the case below).

However, even regardless of that error, I think it could be argued that rebase_expr should perform pre-order traversal in general:

  1. rebase_expr should strive to re-base on the outermost available base expression in order to minimize (however negligible) the computation that needs to be done on top of it. The only reservation I have here is if somehow there is some logic that relies on repeated inside-out columnization which can only be achieved via transform_up (this doesn't seem to be the case judging by the tests).
  2. While transform_up transforms the current sub-expression only after visiting it (and it's children), transform_down transforms it before traversing any child expressions. Since the transformation involved is columnization, which effectively prunes that branch of the expression tree, this leads to strictly less-or-equal number of node visits than transform_up.

To Reproduce

❯ create table t as values ('a', 1, 'a1'), ('a', 2, 'a2'), ('b', 1, 'b1'), ('b', 2, 'b2');
0 rows in set. Query took 0.023 seconds.

❯ select first_value(column3 order by ascii(column1) * (column2 % 2)) from t group by ascii(column1);
Error during planning: Projection references non-aggregate values: Expression t.column3 could not be resolved from available columns: ascii(t.column1), FIRST_VALUE(t.column3) ORDER BY [ascii(t.column1) * t.column2 % Int64(2) ASC NULLS LAST]

Expected behavior

❯ create table t as values ('a', 1, 'a1'), ('a', 2, 'a2'), ('b', 1, 'b1'), ('b', 2, 'b2');
0 rows in set. Query took 0.019 seconds.

❯ select first_value(column3 order by ascii(column1) * (column2 % 2)) from t group by ascii(column1);
+------------------------------------------------------------------------------------------+
| FIRST_VALUE(t.column3) ORDER BY [ascii(t.column1) * t.column2 % Int64(2) ASC NULLS LAST] |
+------------------------------------------------------------------------------------------+
| a2                                                                                       |
| b2                                                                                       |
+------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.034 seconds.

Additional context

No response

@gruuya gruuya added the bug Something isn't working label Nov 20, 2023
@wizardxz
Copy link
Contributor

I found similar issue,
❯ SELECT a + 1 AS d, a + 1 + b AS c FROM (SELECT 1 AS a, 2 AS b) GROUP BY a + 1, a + 1 + b;
Error during planning: Projection references non-aggregate values: Expression b could not be resolved from available columns: a + Int64(1), a + Int64(1) + b

because we are transform_up in rebase_expr()
a+1+b is split to a+1 and b, then we found b is not in group list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants