Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When DataFusion optimizer rewrites an expression, it should not change the expressions "name" / "identity" / "alias" #3794

Closed
alamb opened this issue Oct 11, 2022 · 3 comments
Labels
optimizer Optimizer rules

Comments

@alamb
Copy link
Contributor

alamb commented Oct 11, 2022

Background

Every expression in DataFusion has a name, which is used as the column name. For example, in this example the output contains a single column with the name "COUNT(aggregate_test_100.c9)":

select count(c9) from aggregate_test_100;
+------------------------------+
| COUNT(aggregate_test_100.c9) |
+------------------------------+
| 100                          |
+------------------------------+

These names are used to refer to the columns in both subqueries as well as internally from one stage of the LogicalPlan to another. For example

❯ select "COUNT(aggregate_test_100.c9)" + 1 from (select count(c9) from aggregate_test_100) as sq;
+--------------------------------------------+
| sq.COUNT(aggregate_test_100.c9) + Int64(1) |
+--------------------------------------------+
| 101                                        |
+--------------------------------------------+
1 row in set. Query took 0.005 seconds.

Implication

DataFusion contains an extensive set of OptimizerRules that may rewrite the plan and/or its expressions so they execute more quickly. However, the optimizer rules must so

Because DataFusion identifies columns using a string name, it means it is critical that the names of expressions are not changed by the optimizer when it rewrites expressions. This is typically accomplished by renaming a rewritten expression by adding an alias.

Here is a simple example of such a rewrite. The expression 1 + 2 can be internally simplified to 3 but must still be displayed the same as 1 + 2:

select 1 + 2;
+---------------------+
| Int64(1) + Int64(2) |
+---------------------+
| 3                   |
+---------------------+

Looking at the EXPLAIN output we can see that the optimizer has effectively rewritten 1 + 2 into effectively 3 as "1 + 2":

❯ explain select 1 + 2;
+---------------+-------------------------------------------------+
| plan_type     | plan                                            |
+---------------+-------------------------------------------------+
| logical_plan  | Projection: Int64(3) AS Int64(1) + Int64(2)     |
|               |   EmptyRelation                                 |
| physical_plan | ProjectionExec: expr=[3 as Int64(1) + Int64(2)] |
|               |   EmptyExec: produce_one_row=true               |
|               |                                                 |
+---------------+-------------------------------------------------+
2 rows in set. Query took 0.001 seconds.

If the expression name is not preserved, bugs such as #3704 and #3555 occur where the expected columns can not be found.

Problems

There are at least three places that we have rediscovered the issue of names changing when rewriting expressions such as #3704, #3555 and https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/simplify_expressions.rs#L316

I tried to consolidate this learning into #3727 but , as @liukun4515 has pointed out #3727 (comment) , there are likely several other places that have similar bugs lurking

@andygrove
Copy link
Member

There are currently two ways to create a "name" for an expression in the logical plan.

    /// Returns the name of this expression as it should appear in a schema. This name
    /// will not include any CAST expressions.
    pub fn name(&self) -> Result<String> {
        create_name(self)
    }

    /// Returns a full and complete string representation of this expression.
    pub fn canonical_name(&self) -> String {
        format!("{}", self)
    }

I have seen that we sometimes use name when we should use canonical_name instead.

@andygrove
Copy link
Member

I have stolen the great content in this PR and included it in #3788

@alamb
Copy link
Contributor Author

alamb commented May 14, 2024

I believe this has been resolved for quite some time and there is an automated check that optimizer rules don't change the schema during rewrite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

No branches or pull requests

2 participants