Support general pruning based on <col> = 'const'
in PruningPredicate
#8376
Labels
enhancement
New feature or request
<col> = 'const'
in PruningPredicate
#8376
Is your feature request related to a problem or challenge?
In IOx in certain cases we know that a "container" (parquet file, or set of record batches) has only a single value for some computed quantity (in our case
hash(column) % N
for some constantN
(like 100).For this case, we want to be able to quickly determine, given an arbitrary predicate, if that container could not possible contain the value.
So for example, if we know the container has
hash(column) % 100 = 27
, given a predicate that includes an expression likecolumn = 'foo'
, we can compute the quantityhash(column) % 100
and if it is not27
we can skip the entire container.We could implement this directly in our codebase, and we may do so temporarily. However, I think the usecase is common enough that we would like to improve the support upstream in DataFusion so others can both benefit and help optimize for it.
For example, applying BloomFilters to prune out parquet row groups (added in #7821 by @hengfeiyang) has the same pattern.
Since we also have other information such as min/max and null counts for certain columns that we prune using
PruningPredicate
, having this ability be part ofPruningPredicate
is compellingDescribe the solution you'd like
DataFusion's
PruningPredicate
can already use information on ranges (min/max values). I would like to extend its capabilities to incorporate knowledge about certain specific values to take advantage ofcolumn = <constant>
predicates.I propose we extend PruningStatistics with some way to pass knowledge on about the contents of data structures like Bloom Filters. For example:
We could then implement the bloom filter pruning in DataFusion with this API as well as use the same thing for our downstream usecase
Example for equality predicate
col = 'foo'
The PruningPredicate would call
and could prune all containers that returned
false
Example for inequality predicate
col != 'foo'
The PruningPredicate would call
and could prune all containers that returned
true
Note I don't think the
contains
API could be used for other inequality predicates likecol < 'foo'
for example. The existing min/max statistics would have to be usedDescribe alternatives you've considered
I also thought about trying to rewrite equality predicates to take advantage of the existing min/max statistics (which can represent where a column has only a single value). However, that API doesn't allow for information like Bloom filters which simply can say for sure if the value may be present or not.
Additional context
Here is the code that does bloom filtering: https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L203-L252
The text was updated successfully, but these errors were encountered: