Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid changing expression names during constant folding #1319

Merged
merged 17 commits into from
Nov 22, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 29 additions & 2 deletions datafusion/src/optimizer/constant_folding.rs
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,10 @@ impl OptimizerRule for ConstantFolding {
.expressions()
.into_iter()
.map(|e| {
// We need to keep original expression name, if any.
// Constant folding should not change expression name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let name = &e.name(plan.schema());

// TODO iterate until no changes are made
// during rewrite (evaluating constants can
// enable new simplifications and
Expand All @@ -101,7 +105,30 @@ impl OptimizerRule for ConstantFolding {
// fold constants and then simplify
.rewrite(&mut const_evaluator)?
.rewrite(&mut simplifier)?;
Ok(new_e)

let new_name = &new_e.name(plan.schema());

// Some plans will be candidates in projection pushdown rule to
// trim expressions based on expression names. We need to keep
// expression name for them.
let is_plan_for_projection_pushdown = matches!(
plan,
LogicalPlan::Window { .. }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only those?

What about SELECT 1+1.

Currently this outputs:

❯ SELECT 1+1
;
+----------+
| Int64(2) |
+----------+
| 2        |
+----------+
1 row in set. Query took 0.001 seconds.

I would assume it should keep the Int64(1) + Int64(1) here instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for Project, it will create many (looks redundant) aliases. Some looks okay but some looks really weird, e.g. some failed tests:

Projection: #test.a, #test.d, NOT #test.b AS test.b = Boolean(false)
  ...
Projection: Int32(0) AS CAST(Utf8(\"0\") AS Int32)
  ...

We have a lot tests that would be failed due to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried following the model in https://github.com/apache/arrow-datafusion/pull/1315/files#diff-1d33be1a7e8231e53102eab8112e30aa89d8f5cb8c21cd25bcfbce3050cdb433R110 ? I think that calls columnize_expr among perhaps some other differences.

Basically I think the code needs to do something like walk over the field names in the output schema and if they names of the rewritten exprs weren't the same add an alias;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I agree with @Dandandan that this should apply to all nodes, not just a few special cased ones)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I think the code needs to do something like walk over the field names in the output schema and if they names of the rewritten exprs weren't the same add an alias;

This sounds promising. No, I've not tried columnize_expr. Let me revise this and see if it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying @viirya -- I'll see if I can find some time this weekend to mess around with it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb . I'll keep trying on this too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for Project, it will create many (looks redundant) aliases. Some looks okay but some looks really weird, e.g. some failed tests:

@viirya the example you gave here looks like correct behavior to me, are you concerned with lots of updates on the tests? or are there other unwanted side effect of this approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm simply unsure if such changes are okay here as it looks like most queries will be affected (not about its results but the cosmetic one).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it looks good for you, I will update all the tests.

| LogicalPlan::Aggregate { .. }
| LogicalPlan::Union { .. }
);

if let (Ok(expr_name), Ok(new_expr_name)) = (name, new_name) {
if expr_name != new_expr_name
&& is_plan_for_projection_pushdown
{
Ok(new_e.alias(expr_name))
} else {
Ok(new_e)
}
} else {
Ok(new_e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry we may be silently ignoring some real issues in the future.

However, I tried checking expr_name and new_expr_name for errors and I got a bunch of errors like

---- execution::context::tests::window_partition_by stdout ----
Error: Internal("Create name does not support sort expression")
thread 'execution::context::tests::window_partition_by' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`: the test returned a termination value with a non-zero status code (1) which indicates a failure', /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14
   2: core::panicking::assert_failed_inner
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:177:23
   3: core::panicking::assert_failed
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:140:5
   4: test::assert_test_result
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/test/src/lib.rs:194:5
   5: datafusion::execution::context::tests::window_partition_by::{{closure}}
             at ./src/execution/context.rs:1771:11
   6: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

So I suppose this is as good as we are going to do for now

}
})
.collect::<Result<Vec<_>>>()?;

Expand Down Expand Up @@ -733,7 +760,7 @@ mod tests {
.build()?;

let expected = "\
Aggregate: groupBy=[[#test.a, #test.c]], aggr=[[MAX(#test.b), MIN(#test.b)]]\
Aggregate: groupBy=[[#test.a, #test.c]], aggr=[[MAX(#test.b) AS MAX(test.b = Boolean(true)), MIN(#test.b)]]\
\n Projection: #test.a, #test.c, #test.b\
\n TableScan: test projection=None";

Expand Down
18 changes: 13 additions & 5 deletions datafusion/src/physical_plan/planner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1342,6 +1342,17 @@ impl DefaultPhysicalPlanner {
}
}

fn remove_aliases<'a>(&self, expr: &'a Expr) -> Result<(String, &'a Expr)> {
let (name, e) = match expr {
Expr::Alias(sub_expr, alias) => match &**sub_expr {
Expr::Alias(nested_expr, _) => self.remove_aliases(&*nested_expr)?,
_ => (alias.clone(), sub_expr.as_ref()),
},
_ => (physical_name(expr)?, expr),
};
Ok((name, e))
}

/// Create an aggregate expression from a logical expression or an alias
pub fn create_aggregate_expr(
&self,
Expand All @@ -1350,11 +1361,8 @@ impl DefaultPhysicalPlanner {
physical_input_schema: &Schema,
ctx_state: &ExecutionContextState,
) -> Result<Arc<dyn AggregateExpr>> {
// unpack aliased logical expressions, e.g. "sum(col) as total"
let (name, e) = match e {
Expr::Alias(sub_expr, alias) => (alias.clone(), sub_expr.as_ref()),
_ => (physical_name(e)?, e),
};
// unpack (nested) aliased logical expressions, e.g. "sum(col) as total"
let (name, e) = self.remove_aliases(e)?;

self.create_aggregate_expr_with_name(
e,
Expand Down
16 changes: 16 additions & 0 deletions datafusion/tests/sql.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1282,6 +1282,22 @@ async fn csv_query_approx_count() -> Result<()> {
Ok(())
}

#[tokio::test]
async fn query_count_without_from() -> Result<()> {
let mut ctx = ExecutionContext::new();
let sql = "SELECT count(1 + 1)";
let actual = execute_to_batches(&mut ctx, sql).await;
let expected = vec![
"+----------------------------+",
"| COUNT(Int64(1) + Int64(1)) |",
"+----------------------------+",
"| 1 |",
"+----------------------------+",
];
assert_batches_eq!(expected, &actual);
Ok(())
}

#[tokio::test]
async fn csv_query_array_agg() -> Result<()> {
let mut ctx = ExecutionContext::new();
Expand Down