-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A framework for expression boundary analysis (and statistics) #3898
Comments
I love this idea @isidentical -- thank you for filing it -- I am still not quite sure about how well keeping I also wonder what "apply"ing boundaries to a |
Thanks @alamb (and for all your feedback during the initial design) ❤️
It also bothers me a bit, so maybe we can iterate on it to see if there is a simpler solution that can help us to solve the following problem without having let expr = parse("a >= 20");
let mut context = Context { column_boundaries: [Boundary {min: 1, max: 100}] }; // this can be constructed from statistics as well
let boundaries = expr.analyze(&mut context);
assert!(context.column_boundaries[0].min == 20); If we want the condition above to succeed (considering we now know that The most simple solution that I can think of is actually checking whether This is where
If left were a simple column, it will look the same:
And if any of the expressions in the next conjunction (or actually anything that shares the same context) references
Solely for |
One thing I am not super sure of is whether we want to keep it inside the |
One benefit of keeping it inside As long as they have reasonable default implementations (that return unknown ExprBounds) I think it would be fine |
A general overview of what happened during the discussions (in regards to the expression analysis part) and the meetup:
Although there isn't any concrete work, this was also part of the proposed 'future' section and we might be able to turn it into a reality once we have the general system ready (this is what Spark did IIRC, they initially implemented with basic ranges and then moved over with histograms when available).
This can be supported with the existing framework, but actually needs an implementation 😇 Currently for binary expressions, we have a code path that is taken when we know either side has a scalar boundary (e.g.
I'd be fine with this though just from a purely aesthetic point of view it looks a bit hard to parse 😄 I'm happy to be convinced though. Let's discuss it in the code review again! |
If I didn't miss anything, that should be all and we should be ready to at least make put the foundational work in. @alamb does it make sense to revive my existing PR #3912 (fix conflicts, add a bit more documentation in the code, etc.) and continue working on this iteratively for the next steps? (like other expression types). |
I think this makes sense. If possible, I would recommend trying to get the basic API sketched out / commented well and a few examples -- keeping the PR small is the key I think to getting as many eyes on it as possible. Thank you for driving this forward |
As we have the foundations now, we should be able to close this. New tickets regarding filter selectivity analysis and the overall expression analysis will be tracked in the meta issue of #3929. |
Thank you for driving this forward @isidentical 🚀 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
With the addition of
ExprBounds
API in the #3868, we should be able to transform it to a toolkit that optimizations like predicate pruning (or range based pruning) can leverage.Describe the solution you'd like
A common entry point for each physical expression that can take an analysis context as its input and return the boundaries for that expression. Each boundary corresponds to the range of values an expression can evaluate into.
ctx.cols[a].min
ctx.cols[a].max
bounds(a).min > 5
(bool)bounds(a).max > 5
(bool)bounds(Q1).min + bounds(Q2).min
bounds(Q1).max + bounds(Q2).max
bounds(Q1 + Q2).max
bounds(Q1 + Q2).max
Each expression also leaves behind some partial metadata on what is happening inside its own context. Consider an analysis context which is basically the following:
This is essential for stuff like above where even though the boundaries are the same (it is either false or true, since there is only a partial overlap), we learn so much about what 'a' is after we process this expression. Each expression can either decide to share all the context with its children (e.g. in the case of
AND
, they'll all share a single context so all the following expressions would know thata
for example can be only greater than6
at this point etc.) or fork the context and then merge it together by using its logic (OR
is the most basic example where you have to take the union of the ranges since it could be either of them).Having a simple API (like the following) as well as a shared context seem to fulfil majority of the requirements we have for filter selectivity analysis. Another place where context could add a bit more flexibility is histograms (if we ever support them, which ballista might be in the direction of), where the context would also know the distribution of column
a
etc. SoColumnExpr
can simply learn it from the analysis context.Proposed APIs
Additional context
Originating from #3868 (which implements a more rudimentary version of the proposed framework), with discussion points from @alamb on the possible unification/consolidation of pruning logic (as well as a gateway to more optimizations).
Potential features include:
a > b
)The text was updated successfully, but these errors were encountered: