-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sum statistics and PhysicalExpr::column_statistics #13736
base: main
Are you sure you want to change the base?
Conversation
Is there any combined script to run all the linting checks at once? I don't want to burn all your CI credits! |
Could I please grab another CI approval for this? I think I've run everything locally now |
Thank you! (though we likely are spoiled as the Apache Software Foundation has lots of credits (thank you Github!) The scripts in https://github.com/apache/datafusion/tree/main/ci/scripts can be used to run the tests locally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool Thank you @gatesn -- this is a really neat idea.
Sum statistics are not available in parquet, but I can easily see other file formats providing them so I think this is a good addition to DataFusion.
I really like that this PR integrates nicely with the existing AggregateStatistics
pass / value_from_statistics
pass
There are a few related pieces of functionality:
- The work that @suremarc is looking into for Statistics (that has the potential to change
Precision
-- RFC: AddPrecision:AtLeast
andPrecision::AtMost
for moreStatistics
… precision #13293 (comment)) - The effect of making
ColumnStatistics
even larger (each ScalarValue is already quite large I think so adding another potential field may make statistics management even worse. Again, maybe this will be handled by the revamp that @suremarc is looking into
In terms of testing, I think we should create a "end to end" type test -- that shows registering a TableProvider that can provide sum
statistics that the optimizer uses to optimize away the actual aggregates.
I will go spend some time trying to writeup what I think the consensus for Statistics is.
return None; | ||
} | ||
|
||
if let Precision::Exact(num_rows) = &statistics_args.statistics.num_rows { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very neat idea
@@ -149,6 +151,11 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + DynEq + DynHash { | |||
fn get_properties(&self, _children: &[ExprProperties]) -> Result<ExprProperties> { | |||
Ok(ExprProperties::new_unknown()) | |||
} | |||
|
|||
/// Return the column statistics of this expression given the statistics of the input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API seems somewhat overlapping with PhysicalExpr::evaluate_bounds
I wonder if there is any way to combine the ideas with the Statistics
changes @suremarc is looking into in #13293 (comment) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote up a unified proposal here:
- Introduce a way to represent constrained statistics / bounds on values in Statistics #8078 (comment)
Feedback very welcome
(BTW thank you for this well structured, well documented PR -- it was very easy to read and understand) |
Aha, I might know one of those file formats! Here's the click bench diff (give or take quite a bit of noise): https://github.com/spiraldb/vortex/actions/runs/12297901942#summary-34320120073 Easy to see that Q2 and Q29 drop to constant time, and I think a few of the other queries benefit from AVG aggregations.
I had actually considered other optimizer rules that might enable this, for example rewriting
We ran into similar performance issues in Vortex in fact. We've found that actually having a (This reminds me, a sum statistic also doubles as a true/false count for boolean arrays. I imagine there are more optimizations in DataFusion that could benefit from a pre-computed true/false count) Is having custom statistics something that DataFusion might support? For example, I could declare a custom statistic along with a custom optimizer rule that makes use of it. I can also see the opposite argument that if a statistic is in any way useful, then DataFusion should add support for it internally, and therefore it doesn't need extensible stats.
I will add one of these!
Do you consider this to be blocking for this PR? Or is expanding the size of ColumnStatistics acceptable in the short-term? |
Extending statistics to support user defined data seems very reasonable to me. I good test in my mind to avoid APIs that can't actually be used in the real world, is to try and make some sort of example showing how someone would actually use it (e.g. maybe pass the custom statistics into a user defined function that can take advantage of it somehow?)
I don't consider it blocking per se -- especially if we are (finally) going to get the project to revamp Statistics moving again I would like to get some consensus on what we want to do with Statistics / range / interval evaluation on statistics so that we don't end up with multiple incompatible partially overlapping features. Thank you again |
I have been thinking a lot about this PR and I don't want to let it die because we are stuck in trying to figure out a broader staistics question. I would like to find an incremental way forward. Here is my proposal:
I recommend:
|
Apologies, I got drawn into some other things over the break. Time to get this moving again! Sounds like a plan. Here's the first PR: https://github.com/apache/datafusion/pull/14074/files |
Which issue does this PR close?
Fixes #992
Rationale for this change
Some statistics can propagate through expressions, such as min, max and sum.
In this particular case, I was looking at the ClickBench Q29 and realized we had no way to report sum statistics to DataFusion (which would also help for avg).
Q29 looks like this btw:
And with correctly reported sum statistics, both Q2 and Q29 collapse down to O(1).
What changes are included in this PR?
PhysicalExpr
has a new defaulted trait functioncolumn_statistics
that takes aStatistics
and returns statistics for the columnar result of the expression. (Unlike the linked issue which proposes returning a full Statistics object).Further, this PR adds a
sum
statistic to demonstrate the value of propagation (that turns into Precision::Absent on overflow).Are these changes tested?
Are there any user-facing changes?