Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A framework for expression boundary analysis (and statistics) #3898

Closed
Tracked by #3929
isidentical opened this issue Oct 20, 2022 · 9 comments
Closed
Tracked by #3929

A framework for expression boundary analysis (and statistics) #3898

isidentical opened this issue Oct 20, 2022 · 9 comments
Labels
enhancement New feature or request

Comments

@isidentical
Copy link
Contributor

isidentical commented Oct 20, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
With the addition of ExprBounds API in the #3868, we should be able to transform it to a toolkit that optimizations like predicate pruning (or range based pruning) can leverage.

Describe the solution you'd like
A common entry point for each physical expression that can take an analysis context as its input and return the boundaries for that expression. Each boundary corresponds to the range of values an expression can evaluate into.

min max
a ctx.cols[a].min ctx.cols[a].max
5 5 5
a > 5 bounds(a).min > 5 (bool) bounds(a).max > 5 (bool)
Q1 + Q2 bounds(Q1).min + bounds(Q2).min bounds(Q1).max + bounds(Q2).max
max(Q1 + Q2) bounds(Q1 + Q2).max bounds(Q1 + Q2).max

Each expression also leaves behind some partial metadata on what is happening inside its own context. Consider an analysis context which is basically the following:

Context Before Context After Expr Boundaries
a {a: [1, 10]} {a: [1, 10]} [1, 10]
a > 5 {a: [1, 10]} {a: [6, 10]} [false, true]
a > 5 AND a < 7 {a: [1, 10]} {a: [6]} [false, true]
a > 8 OR a = 1 {a: [1, 10]} {a: [1] U [9, 10]} [false, true]

This is essential for stuff like above where even though the boundaries are the same (it is either false or true, since there is only a partial overlap), we learn so much about what 'a' is after we process this expression. Each expression can either decide to share all the context with its children (e.g. in the case of AND, they'll all share a single context so all the following expressions would know that a for example can be only greater than 6 at this point etc.) or fork the context and then merge it together by using its logic (OR is the most basic example where you have to take the union of the ranges since it could be either of them).

Having a simple API (like the following) as well as a shared context seem to fulfil majority of the requirements we have for filter selectivity analysis. Another place where context could add a bit more flexibility is histograms (if we ever support them, which ballista might be in the direction of), where the context would also know the distribution of column a etc. So ColumnExpr can simply learn it from the analysis context.

Proposed APIs

pub trait PhysicalExpr: Send + Sync + Display + Debug {
    [...]

    /// Return the boundaries of this expression. This method (and all the
    /// related APIs) are experimental and subject to change.
    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {
        None
    }

    /// Apply the given boundaries to this expression (and its child expressions,
    /// if they are relevant to the boundaries).
    fn apply(&self, context: &mut AnalysisContext, boundaries: &ExprBoundaries) {}
}
/// Represents boundaries of an expression, if it were to be evaluated.
#[derive(Clone, Debug, PartialEq)]
pub struct ExprBoundaries {
    /// The result type of the expression (should be same as expr.data_type())
    pub data_type: DataType,
    /// Minimum value this expression's result can have.
    pub min_value: ScalarValue,
    /// Maximum value this expression's result can have.
    pub max_value: ScalarValue,
    /// Maximum number of distinct values this expression can produce, if known.
    pub distinct_count: Option<usize>,
    /// Row selectivity of this expression if it were used as a predicate, as a
    /// value between 0 and 1.
    pub selectivity: Option<f64>,
}
/// A context for collecting and aggregating known boundaries of an expression
/// tree.
pub struct AnalysisContext {
    /// A list of known column boundaries, ordered by column index
    /// in the current schema.
    pub column_boundaries: Vec<Option<ExprBoundaries>>,
}

Additional context
Originating from #3868 (which implements a more rudimentary version of the proposed framework), with discussion points from @alamb on the possible unification/consolidation of pruning logic (as well as a gateway to more optimizations).

Potential features include:

  • Not estimating about min/max (e.g. selectivity can be an estimate, but should be able to determine min/max quite easily if we see a predicate like a > b)
@alamb
Copy link
Contributor

alamb commented Oct 20, 2022

I love this idea @isidentical -- thank you for filing it -- I am still not quite sure about how well keeping column_boundaries would work in practice, but I think I would just have to see how it works in practice to find out.

I also wonder what "apply"ing boundaries to a PhysicalExpr means -- does it mean the output of the PhysicalExpr might change when run against some data? Does it mean the PhysicalExpr might change? Or is it solely to update the AnalysisContext?

@isidentical
Copy link
Contributor Author

isidentical commented Oct 20, 2022

I love this idea @isidentical -- thank you for filing it --

Thanks @alamb (and for all your feedback during the initial design) ❤️

I am still not quite sure about how well keeping column_boundaries would work in practice, but I think I would just have to see how it works in practice to find out.

It also bothers me a bit, so maybe we can iterate on it to see if there is a simpler solution that can help us to solve the following problem without having apply() and column_boundaries.

let expr = parse("a >= 20");
let mut context = Context { column_boundaries: [Boundary {min: 1, max: 100}] }; // this can be constructed from statistics as well
let boundaries = expr.analyze(&mut context);
assert!(context.column_boundaries[0].min == 20);

If we want the condition above to succeed (considering we now know that a's minimum value is 20, due to a >= 20), we need a way of translating that information to column level (so that, if the next expression [in the same context] is a < 30 we can localize it even further, etc.).

The most simple solution that I can think of is actually checking whether left side is a column, and if so, changing the boundaries for it (context.boundaries[left.index] = Boundary {min: 20, max: left.max}) directly inside the filter selectivity analysis. But what can we do if the expression looks something like this: a + 1 > 20?

This is where apply() comes into play. When the analysis for a + 1 >= 20 is complete, we go through the following cycles:

  • (a + 1 >= 20's analyze) left.apply(Boundary {min: 20, max: left.max})
    • (a + 1's apply) left.apply(Boundary {min: boundaries.min - 1, max: boundaries.max - 1})
      • (a's apply) context.columns[self.index] = boundaries

If left were a simple column, it will look the same:

  • (a >= 20's analyze) left.apply(Boundary {min: 20, max: left.max})
    • (a's apply) context.columns[self.index] = boundaries

And if any of the expressions in the next conjunction (or actually anything that shares the same context) references a, then it will just use the updated boundaries without doing anything.

I also wonder what "apply"ing boundaries to a PhysicalExpr means -- does it mean the output of the PhysicalExpr might change when run against some data? Does it mean the PhysicalExpr might change? Or is it solely to update the AnalysisContext?

Solely for AnalysisContext.

@isidentical
Copy link
Contributor Author

One thing I am not super sure of is whether we want to keep it inside the PhysicalExpr (with two new methods) or want to keep it in a separated protocol (so we would have a new trait, PhysicalExprAnalyzer and PhysicalExpr can then return the analyzer itself which would have apply/update methods). I am thinking that if it is only two methods, they might as well live inside the PhysicalExpr (which would then remove a level indirection when we are actually doing the analysis).

@alamb
Copy link
Contributor

alamb commented Oct 20, 2022

One benefit of keeping it inside PhysicalExpr is that then anyone who has an extension for PhysicalExpr can also provide their own expression analysis

As long as they have reasonable default implementations (that return unknown ExprBounds) I think it would be fine

@isidentical
Copy link
Contributor Author

A general overview of what happened during the discussions (in regards to the expression analysis part) and the meetup:

  • Better value distributions rather than just knowing the [min, max].

Although there isn't any concrete work, this was also part of the proposed 'future' section and we might be able to turn it into a reality once we have the general system ready (this is what Spark did IIRC, they initially implemented with basic ranges and then moved over with histograms when available).

  • Support for comparing two expressions (a > b) with different ranges.

This can be supported with the existing framework, but actually needs an implementation 😇 Currently for binary expressions, we have a code path that is taken when we know either side has a scalar boundary (e.g. a=[20, 20]; b=[10, 30]) but we can introduce a second one very easily depending on how it would work 👀 Requires further research, but I don't think it needs to block this ticket (it can actually be a part of it).

  • Not passing a mutable handle for the context but rather giving it up completely.

I'd be fine with this though just from a purely aesthetic point of view it looks a bit hard to parse 😄 I'm happy to be convinced though. Let's discuss it in the code review again!

@isidentical
Copy link
Contributor Author

If I didn't miss anything, that should be all and we should be ready to at least make put the foundational work in. @alamb does it make sense to revive my existing PR #3912 (fix conflicts, add a bit more documentation in the code, etc.) and continue working on this iteratively for the next steps? (like other expression types).

@alamb
Copy link
Contributor

alamb commented Oct 31, 2022

. @alamb does it make sense to revive my existing PR #3912 (fix conflicts, add a bit more documentation in the code, etc.)

I think this makes sense. If possible, I would recommend trying to get the basic API sketched out / commented well and a few examples -- keeping the PR small is the key I think to getting as many eyes on it as possible.

Thank you for driving this forward

@isidentical
Copy link
Contributor Author

As we have the foundations now, we should be able to close this. New tickets regarding filter selectivity analysis and the overall expression analysis will be tracked in the meta issue of #3929.

@alamb
Copy link
Contributor

alamb commented Nov 10, 2022

Thank you for driving this forward @isidentical 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants