Fix `AggregateStatistics` optimization so it doesn't change output type #2674

alamb · 2022-06-01T13:36:42Z

Which issue does this PR close?

Closes #2673

Rationale for this change

See #2673 -- the optimizer pass is changing input types.

What changes are included in this PR?

Fix bug
New Regresion test
Give some constants related to COUNT(*) expansion symbolic names to improve readability

Are there any user-facing changes?

less bugs

Does this PR break compatibility with Ballista?

No

alamb · 2022-06-01T13:54:41Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

@@ -37,6 +38,9 @@ use crate::error::Result;
 #[derive(Default)]
 pub struct AggregateStatistics {}

+/// The name of the column corresponding to [`COUNT_STAR_EXPANSION`]
+const COUNT_STAR_NAME: &str = "COUNT(UInt8(1))";


This constant was hard coded in a few places and I think this symbolic name helps understand what it is doing

alamb · 2022-06-01T13:55:10Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

@@ -148,10 +152,10 @@ fn take_optimizable_table_count(
                .as_any()
                .downcast_ref::<expressions::Literal>()
            {
-                if lit_expr.value() == &ScalarValue::UInt8(Some(1)) {
+                if lit_expr.value() == &COUNT_STAR_EXPANSION {


There was an implicit coupling between the SQL planner and this file, which I have now made explicit with a named constant

alamb · 2022-06-01T13:55:54Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

+
+        // Validate that the optimized plan returns the exact same
+        // answer (both schema and data) as the original plan
+        let task_ctx = session_ctx.task_ctx();


This test would have caught this issue when it was introduced in #2636

alamb · 2022-06-01T13:56:17Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

        Ok(())
    }

+    /// Normalize record batches for comparison:
+    /// 1. Sets nullable to `true`
+    fn normalize(batches: Vec<RecordBatch>) -> Vec<RecordBatch> {


This is stupid but necessary to pass the tests

alamb · 2022-06-01T13:56:56Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

        };
-        Arc::new(Count::new(expr, "my_count_alias", DataType::UInt64))
+
+        Arc::new(Count::new(expr, name, DataType::Int64))


Now that the schema is checked, we can't use some arbitrary column name, we need to use the actual name the plan would

alamb · 2022-06-01T13:57:33Z

datafusion/sql/src/planner.rs

            AggregateFunction::Count => function
                .args
                .into_iter()
                .map(|a| match a {
                    FunctionArg::Unnamed(FunctionArgExpr::Expr(SQLExpr::Value(
                        Value::Number(_, _),
-                    ))) => Ok(lit(1_u8)),
-                    FunctionArg::Unnamed(FunctionArgExpr::Wildcard) => Ok(lit(1_u8)),
+                    ))) => Ok(Expr::Literal(COUNT_STAR_EXPANSION.clone())),


this is a readability improvement to name a constant to make what is happening more explicit

alamb · 2022-06-01T13:59:31Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

                    return Some((
-                        ScalarValue::UInt64(Some(num_rows as u64)),
-                        "COUNT(UInt8(1))",
+                        ScalarValue::Int64(Some(num_rows as i64)),


The change from UInt64 to Int64 here and a few lines below is the actual bug fix / change of behavior -- the rest of this PR is testing / readability improvements

tustvold

Looks good to me, minor comment about test readability.

I also wonder if the fact it isn't always not nullable is actually a bug in the aggregate function, can count(..) ever return NULL? I thought it just skipped nulls?

tustvold · 2022-06-02T11:48:14Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

        let conf = session_ctx.copied_config();
-        let optimized = AggregateStatistics::new().optimize(Arc::new(plan), &conf)?;
+        let plan = Arc::new(plan) as _;
+        let optimized = AggregateStatistics::new().optimize(Arc::clone(&plan), &conf)?;

        let (col, count) = match nulls {


I was very confused by what this parameter controls, should it not be something like column: Option<&str> instead?

I believe it is really controlling count(*) vs COUNT(col) -- I consolidated the differences in eb14658 into a TestAggregate struct and I think it is much more understandable now

tustvold · 2022-06-02T11:51:49Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

+        // answer (both schema and data) as the original plan
+        let task_ctx = session_ctx.task_ctx();
+        let plan_result = common::collect(plan.execute(0, task_ctx)?).await?;
+        assert_eq!(normalize(result), normalize(plan_result));


A couple of lines above there is

assert_eq!(result[0].schema(), Arc::new(Schema::new(vec![col])));

This would suggest to me that the result has a single column and a single field. Perhaps we could just do something like

let expected_a_schema = ..; let expected_b_schema = ..; for (a, b) in result.iter().zip(plan_result) { assert_eq!(a.column(0), b.column(0); assert_eq!(a.schema(), expected_a_schema); assert_eq!(b.schema(), expected_b_schema); }

I think the normalization logic is a little bit hard to follow...

I removed the normalization in 171c899 and I think it is much simpler to follow now

…ion_type

alamb · 2022-06-02T19:10:59Z

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

-        };
-        Arc::new(Count::new(expr, "my_count_alias", DataType::UInt64))
+    /// Describe the type of aggregate being tested
+    enum TestAggregate {


This now parameterizes the difference between different tests into an explicit enum rather than implicit assumptions. I think it makes the tests easier to follow

codecov-commenter · 2022-06-02T19:35:41Z

Codecov Report

Merging #2674 (fde3cc4) into master (f547262) will increase coverage by 0.01%.
The diff coverage is 98.59%.

@@            Coverage Diff             @@
##           master    #2674      +/-   ##
==========================================
+ Coverage   84.69%   84.70%   +0.01%     
==========================================
  Files         267      267              
  Lines       47004    47036      +32     
==========================================
+ Hits        39810    39843      +33     
+ Misses       7194     7193       -1

Impacted Files	Coverage Δ
datafusion/expr/src/utils.rs	`91.86% <ø> (ø)`
datafusion/sql/src/planner.rs	`81.56% <66.66%> (-0.04%)`	⬇️
...ore/src/physical_optimizer/aggregate_statistics.rs	`100.00% <100.00%> (ø)`
datafusion/core/tests/custom_sources.rs	`83.90% <100.00%> (ø)`
datafusion/common/src/scalar.rs	`74.94% <0.00%> (+0.11%)`	⬆️
datafusion/core/src/physical_plan/metrics/value.rs	`87.43% <0.00%> (+0.50%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f547262...fde3cc4. Read the comment docs.

Fix AggregateStatistics optimization so it doens't change output type

0663979

github-actions bot added core Core DataFusion crate datafusion Changes in the datafusion crate labels Jun 1, 2022

alamb added 2 commits June 1, 2022 09:37

fix test

b82c30e

Give some constants symbolic names to improve readability

85e54f2

github-actions bot added logical-expr Logical plan and expressions sql SQL Planner labels Jun 1, 2022

alamb commented Jun 1, 2022

View reviewed changes

alamb marked this pull request as ready for review June 1, 2022 13:57

alamb requested review from jimexist, andygrove and tustvold June 1, 2022 13:58

alamb commented Jun 1, 2022

View reviewed changes

alamb mentioned this pull request Jun 2, 2022

Proposal: remove automated ballista CI checks from DataFusion #2679

Closed

tustvold changed the title ~~Fix AggregateStatistics optimization so it doens't change output type~~ Fix AggregateStatistics optimization so it doesn't change output type Jun 2, 2022

tustvold approved these changes Jun 2, 2022

View reviewed changes

alamb added 4 commits June 2, 2022 14:07

Merge remote-tracking branch 'apache/master' into alamb/fix_optimizat…

5948c0e

…ion_type

Consolidate expected differences in COUNT(*) and COUNT(a) in tests

eb14658

Simplify how the verification is done

171c899

fmt

fde3cc4

alamb commented Jun 2, 2022

View reviewed changes

alamb merged commit 8ddd99c into apache:master Jun 2, 2022

alamb deleted the alamb/fix_optimization_type branch June 2, 2022 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `AggregateStatistics` optimization so it doesn't change output type #2674

Fix `AggregateStatistics` optimization so it doesn't change output type #2674

alamb commented Jun 1, 2022 •

edited

Loading

alamb Jun 1, 2022

alamb Jun 1, 2022

alamb Jun 1, 2022

alamb Jun 1, 2022

alamb Jun 1, 2022

alamb Jun 1, 2022

alamb Jun 1, 2022

tustvold left a comment •

edited

Loading

tustvold Jun 2, 2022

alamb Jun 2, 2022

tustvold Jun 2, 2022

alamb Jun 2, 2022

alamb Jun 2, 2022

codecov-commenter commented Jun 2, 2022

Fix AggregateStatistics optimization so it doesn't change output type #2674

Fix AggregateStatistics optimization so it doesn't change output type #2674

Conversation

alamb commented Jun 1, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Does this PR break compatibility with Ballista?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 2, 2022

Codecov Report

Fix `AggregateStatistics` optimization so it doesn't change output type #2674

Fix `AggregateStatistics` optimization so it doesn't change output type #2674

alamb commented Jun 1, 2022 •

edited

Loading

tustvold left a comment •

edited

Loading