-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix AggregateStatistics
optimization so it doesn't change output type
#2674
Conversation
@@ -37,6 +38,9 @@ use crate::error::Result; | |||
#[derive(Default)] | |||
pub struct AggregateStatistics {} | |||
|
|||
/// The name of the column corresponding to [`COUNT_STAR_EXPANSION`] | |||
const COUNT_STAR_NAME: &str = "COUNT(UInt8(1))"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant was hard coded in a few places and I think this symbolic name helps understand what it is doing
@@ -148,10 +152,10 @@ fn take_optimizable_table_count( | |||
.as_any() | |||
.downcast_ref::<expressions::Literal>() | |||
{ | |||
if lit_expr.value() == &ScalarValue::UInt8(Some(1)) { | |||
if lit_expr.value() == &COUNT_STAR_EXPANSION { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an implicit coupling between the SQL planner and this file, which I have now made explicit with a named constant
|
||
// Validate that the optimized plan returns the exact same | ||
// answer (both schema and data) as the original plan | ||
let task_ctx = session_ctx.task_ctx(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would have caught this issue when it was introduced in #2636
Ok(()) | ||
} | ||
|
||
/// Normalize record batches for comparison: | ||
/// 1. Sets nullable to `true` | ||
fn normalize(batches: Vec<RecordBatch>) -> Vec<RecordBatch> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stupid but necessary to pass the tests
}; | ||
Arc::new(Count::new(expr, "my_count_alias", DataType::UInt64)) | ||
|
||
Arc::new(Count::new(expr, name, DataType::Int64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the schema is checked, we can't use some arbitrary column name, we need to use the actual name the plan would
AggregateFunction::Count => function | ||
.args | ||
.into_iter() | ||
.map(|a| match a { | ||
FunctionArg::Unnamed(FunctionArgExpr::Expr(SQLExpr::Value( | ||
Value::Number(_, _), | ||
))) => Ok(lit(1_u8)), | ||
FunctionArg::Unnamed(FunctionArgExpr::Wildcard) => Ok(lit(1_u8)), | ||
))) => Ok(Expr::Literal(COUNT_STAR_EXPANSION.clone())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a readability improvement to name a constant to make what is happening more explicit
return Some(( | ||
ScalarValue::UInt64(Some(num_rows as u64)), | ||
"COUNT(UInt8(1))", | ||
ScalarValue::Int64(Some(num_rows as i64)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from UInt64
to Int64
here and a few lines below is the actual bug fix / change of behavior -- the rest of this PR is testing / readability improvements
AggregateStatistics
optimization so it doens't change output typeAggregateStatistics
optimization so it doesn't change output type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, minor comment about test readability.
I also wonder if the fact it isn't always not nullable is actually a bug in the aggregate function, can count(..)
ever return NULL? I thought it just skipped nulls?
let conf = session_ctx.copied_config(); | ||
let optimized = AggregateStatistics::new().optimize(Arc::new(plan), &conf)?; | ||
let plan = Arc::new(plan) as _; | ||
let optimized = AggregateStatistics::new().optimize(Arc::clone(&plan), &conf)?; | ||
|
||
let (col, count) = match nulls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was very confused by what this parameter controls, should it not be something like column: Option<&str>
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is really controlling count(*)
vs COUNT(col)
-- I consolidated the differences in eb14658 into a TestAggregate
struct and I think it is much more understandable now
// answer (both schema and data) as the original plan | ||
let task_ctx = session_ctx.task_ctx(); | ||
let plan_result = common::collect(plan.execute(0, task_ctx)?).await?; | ||
assert_eq!(normalize(result), normalize(plan_result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of lines above there is
assert_eq!(result[0].schema(), Arc::new(Schema::new(vec![col])));
This would suggest to me that the result has a single column and a single field. Perhaps we could just do something like
let expected_a_schema = ..;
let expected_b_schema = ..;
for (a, b) in result.iter().zip(plan_result) {
assert_eq!(a.column(0), b.column(0);
assert_eq!(a.schema(), expected_a_schema);
assert_eq!(b.schema(), expected_b_schema);
}
I think the normalization logic is a little bit hard to follow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the normalization in 171c899 and I think it is much simpler to follow now
}; | ||
Arc::new(Count::new(expr, "my_count_alias", DataType::UInt64)) | ||
/// Describe the type of aggregate being tested | ||
enum TestAggregate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now parameterizes the difference between different tests into an explicit enum
rather than implicit assumptions. I think it makes the tests easier to follow
Codecov Report
@@ Coverage Diff @@
## master #2674 +/- ##
==========================================
+ Coverage 84.69% 84.70% +0.01%
==========================================
Files 267 267
Lines 47004 47036 +32
==========================================
+ Hits 39810 39843 +33
+ Misses 7194 7193 -1
Continue to review full report at Codecov.
|
Which issue does this PR close?
Closes #2673
Rationale for this change
See #2673 -- the optimizer pass is changing input types.
What changes are included in this PR?
COUNT(*)
expansion symbolic names to improve readabilityAre there any user-facing changes?
less bugs
Does this PR break compatibility with Ballista?
No