A framework for expression boundary analysis (and statistics) #3898

isidentical · 2022-10-20T00:32:04Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
With the addition of ExprBounds API in the #3868, we should be able to transform it to a toolkit that optimizations like predicate pruning (or range based pruning) can leverage.

Describe the solution you'd like
A common entry point for each physical expression that can take an analysis context as its input and return the boundaries for that expression. Each boundary corresponds to the range of values an expression can evaluate into.

	min	max
a	`ctx.cols[a].min`	`ctx.cols[a].max`
5	5	5
a > 5	`bounds(a).min > 5` (bool)	`bounds(a).max > 5` (bool)
Q1 + Q2	`bounds(Q1).min + bounds(Q2).min`	`bounds(Q1).max + bounds(Q2).max`
max(Q1 + Q2)	`bounds(Q1 + Q2).max`	`bounds(Q1 + Q2).max`

Each expression also leaves behind some partial metadata on what is happening inside its own context. Consider an analysis context which is basically the following:

	Context Before	Context After	Expr Boundaries
a	{a: [1, 10]}	{a: [1, 10]}	[1, 10]
a > 5	{a: [1, 10]}	{a: [6, 10]}	[false, true]
a > 5 AND a < 7	{a: [1, 10]}	{a: [6]}	[false, true]
a > 8 OR a = 1	{a: [1, 10]}	{a: [1] U [9, 10]}	[false, true]

This is essential for stuff like above where even though the boundaries are the same (it is either false or true, since there is only a partial overlap), we learn so much about what 'a' is after we process this expression. Each expression can either decide to share all the context with its children (e.g. in the case of AND, they'll all share a single context so all the following expressions would know that a for example can be only greater than 6 at this point etc.) or fork the context and then merge it together by using its logic (OR is the most basic example where you have to take the union of the ranges since it could be either of them).

Having a simple API (like the following) as well as a shared context seem to fulfil majority of the requirements we have for filter selectivity analysis. Another place where context could add a bit more flexibility is histograms (if we ever support them, which ballista might be in the direction of), where the context would also know the distribution of column a etc. So ColumnExpr can simply learn it from the analysis context.

Proposed APIs

pub trait PhysicalExpr: Send + Sync + Display + Debug {
    [...]

    /// Return the boundaries of this expression. This method (and all the
    /// related APIs) are experimental and subject to change.
    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> {
        None
    }

    /// Apply the given boundaries to this expression (and its child expressions,
    /// if they are relevant to the boundaries).
    fn apply(&self, context: &mut AnalysisContext, boundaries: &ExprBoundaries) {}
}

/// Represents boundaries of an expression, if it were to be evaluated.
#[derive(Clone, Debug, PartialEq)]
pub struct ExprBoundaries {
    /// The result type of the expression (should be same as expr.data_type())
    pub data_type: DataType,
    /// Minimum value this expression's result can have.
    pub min_value: ScalarValue,
    /// Maximum value this expression's result can have.
    pub max_value: ScalarValue,
    /// Maximum number of distinct values this expression can produce, if known.
    pub distinct_count: Option<usize>,
    /// Row selectivity of this expression if it were used as a predicate, as a
    /// value between 0 and 1.
    pub selectivity: Option<f64>,
}

/// A context for collecting and aggregating known boundaries of an expression
/// tree.
pub struct AnalysisContext {
    /// A list of known column boundaries, ordered by column index
    /// in the current schema.
    pub column_boundaries: Vec<Option<ExprBoundaries>>,
}

Additional context
Originating from #3868 (which implements a more rudimentary version of the proposed framework), with discussion points from @alamb on the possible unification/consolidation of pruning logic (as well as a gateway to more optimizations).

Potential features include:

Not estimating about min/max (e.g. selectivity can be an estimate, but should be able to determine min/max quite easily if we see a predicate like a > b)

The text was updated successfully, but these errors were encountered:

alamb · 2022-10-20T18:31:28Z

I love this idea @isidentical -- thank you for filing it -- I am still not quite sure about how well keeping column_boundaries would work in practice, but I think I would just have to see how it works in practice to find out.

I also wonder what "apply"ing boundaries to a PhysicalExpr means -- does it mean the output of the PhysicalExpr might change when run against some data? Does it mean the PhysicalExpr might change? Or is it solely to update the AnalysisContext?

isidentical · 2022-10-20T19:06:47Z

I love this idea @isidentical -- thank you for filing it --

Thanks @alamb (and for all your feedback during the initial design) ❤️

I am still not quite sure about how well keeping column_boundaries would work in practice, but I think I would just have to see how it works in practice to find out.

It also bothers me a bit, so maybe we can iterate on it to see if there is a simpler solution that can help us to solve the following problem without having apply() and column_boundaries.

let expr = parse("a >= 20");
let mut context = Context { column_boundaries: [Boundary {min: 1, max: 100}] }; // this can be constructed from statistics as well
let boundaries = expr.analyze(&mut context);
assert!(context.column_boundaries[0].min == 20);

If we want the condition above to succeed (considering we now know that a's minimum value is 20, due to a >= 20), we need a way of translating that information to column level (so that, if the next expression [in the same context] is a < 30 we can localize it even further, etc.).

The most simple solution that I can think of is actually checking whether left side is a column, and if so, changing the boundaries for it (context.boundaries[left.index] = Boundary {min: 20, max: left.max}) directly inside the filter selectivity analysis. But what can we do if the expression looks something like this: a + 1 > 20?

This is where apply() comes into play. When the analysis for a + 1 >= 20 is complete, we go through the following cycles:

(a + 1 >= 20's analyze) left.apply(Boundary {min: 20, max: left.max})
- (a + 1's apply) left.apply(Boundary {min: boundaries.min - 1, max: boundaries.max - 1})
  - (a's apply) context.columns[self.index] = boundaries

If left were a simple column, it will look the same:

(a >= 20's analyze) left.apply(Boundary {min: 20, max: left.max})
- (a's apply) context.columns[self.index] = boundaries

And if any of the expressions in the next conjunction (or actually anything that shares the same context) references a, then it will just use the updated boundaries without doing anything.

I also wonder what "apply"ing boundaries to a PhysicalExpr means -- does it mean the output of the PhysicalExpr might change when run against some data? Does it mean the PhysicalExpr might change? Or is it solely to update the AnalysisContext?

Solely for AnalysisContext.

isidentical · 2022-10-20T19:09:30Z

One thing I am not super sure of is whether we want to keep it inside the PhysicalExpr (with two new methods) or want to keep it in a separated protocol (so we would have a new trait, PhysicalExprAnalyzer and PhysicalExpr can then return the analyzer itself which would have apply/update methods). I am thinking that if it is only two methods, they might as well live inside the PhysicalExpr (which would then remove a level indirection when we are actually doing the analysis).

alamb · 2022-10-20T19:26:55Z

One benefit of keeping it inside PhysicalExpr is that then anyone who has an extension for PhysicalExpr can also provide their own expression analysis

As long as they have reasonable default implementations (that return unknown ExprBounds) I think it would be fine

isidentical · 2022-10-31T19:25:20Z

A general overview of what happened during the discussions (in regards to the expression analysis part) and the meetup:

Better value distributions rather than just knowing the [min, max].

Although there isn't any concrete work, this was also part of the proposed 'future' section and we might be able to turn it into a reality once we have the general system ready (this is what Spark did IIRC, they initially implemented with basic ranges and then moved over with histograms when available).

Support for comparing two expressions (a > b) with different ranges.

This can be supported with the existing framework, but actually needs an implementation 😇 Currently for binary expressions, we have a code path that is taken when we know either side has a scalar boundary (e.g. a=[20, 20]; b=[10, 30]) but we can introduce a second one very easily depending on how it would work 👀 Requires further research, but I don't think it needs to block this ticket (it can actually be a part of it).

Not passing a mutable handle for the context but rather giving it up completely.

I'd be fine with this though just from a purely aesthetic point of view it looks a bit hard to parse 😄 I'm happy to be convinced though. Let's discuss it in the code review again!

isidentical · 2022-10-31T19:27:20Z

If I didn't miss anything, that should be all and we should be ready to at least make put the foundational work in. @alamb does it make sense to revive my existing PR #3912 (fix conflicts, add a bit more documentation in the code, etc.) and continue working on this iteratively for the next steps? (like other expression types).

alamb · 2022-10-31T20:06:23Z

. @alamb does it make sense to revive my existing PR #3912 (fix conflicts, add a bit more documentation in the code, etc.)

I think this makes sense. If possible, I would recommend trying to get the basic API sketched out / commented well and a few examples -- keeping the PR small is the key I think to getting as many eyes on it as possible.

Thank you for driving this forward

isidentical · 2022-11-09T22:03:49Z

As we have the foundations now, we should be able to close this. New tickets regarding filter selectivity analysis and the overall expression analysis will be tracked in the meta issue of #3929.

alamb · 2022-11-10T11:48:54Z

Thank you for driving this forward @isidentical 🚀

isidentical added the enhancement New feature or request label Oct 20, 2022

isidentical mentioned this issue Oct 20, 2022

Implement foundational filter selectivity analysis #3868

Merged

isidentical mentioned this issue Oct 20, 2022

Expression boundary analysis framework #3912

Merged

alamb mentioned this issue Oct 22, 2022

Implement statistics estimation for FilterExec #3845

Closed

isidentical mentioned this issue Oct 22, 2022

[EPIC] Improving cost calculations and cost based optimizations #3929

Open

5 tasks

isidentical closed this as completed Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A framework for expression boundary analysis (and statistics) #3898

A framework for expression boundary analysis (and statistics) #3898

isidentical commented Oct 20, 2022 •

edited

Loading

alamb commented Oct 20, 2022

isidentical commented Oct 20, 2022 •

edited

Loading

isidentical commented Oct 20, 2022

alamb commented Oct 20, 2022

isidentical commented Oct 31, 2022

isidentical commented Oct 31, 2022

alamb commented Oct 31, 2022

isidentical commented Nov 9, 2022

alamb commented Nov 10, 2022

A framework for expression boundary analysis (and statistics) #3898

A framework for expression boundary analysis (and statistics) #3898

Comments

isidentical commented Oct 20, 2022 • edited Loading

alamb commented Oct 20, 2022

isidentical commented Oct 20, 2022 • edited Loading

isidentical commented Oct 20, 2022

alamb commented Oct 20, 2022

isidentical commented Oct 31, 2022

isidentical commented Oct 31, 2022

alamb commented Oct 31, 2022

isidentical commented Nov 9, 2022

alamb commented Nov 10, 2022

isidentical commented Oct 20, 2022 •

edited

Loading

isidentical commented Oct 20, 2022 •

edited

Loading