Can't filter rowgroup for parquet prune for some data type #2962

liukun4515 · 2022-07-25T13:29:54Z

Describe the bug
In the RowGroupPruningStatistics, we use the statistics to prune the row group for parquet file.

In the below logical:
https://github.com/apache/arrow-datafusion/blob/f386f7a7344d54455fe04d92248e373fac990e6d/datafusion/core/src/physical_plan/file_format/parquet.rs#L392
to get the min and max for a column.

But the logic has bug for the data type.

In the parquet, we can use INT32、INT64 or BINARY to store decimal value, but in the below logical, we can't get the right type of the ArrayRef.
To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

mingmwang · 2022-07-28T07:29:06Z

@alamb
Regarding the parquet row group pruning, the current pruning logic covers the stats pruning which is common for any columnar storage who provides stats and can be reused. But for parquet format, it also has specific pruning like dict pruning, bloom filter pruning, those two types of pruning is not implemented yet. Maybe those two types of pruning should be part of the parquet arrow project. And in the current parquet reader implementation, I do not find a method we can use to read the dictionary page out and use it to construct a Set for filtering purpose.

alamb · 2022-07-28T10:46:06Z

Maybe those two types of pruning should be part of the parquet arrow project.

I suspect additional filter pushdown will require changes in both the parquet reader and then datafusion

I think there is work underway by @Ted-Jiang @liukun4515 @thinkharderdev and @tustvold to implement "Page Pruning" which I think may be what you are referring to here (it allows the parquet reader to skip materializing/decoding positions based on evaluating the predicates) -- the work is partially described in apache/arrow-rs#1191

In terms of using parquet bloom filters, I suspect that would also need work in parquet and datafusion, and I don't know of any efforts underway to do so. @shanisolomon added initial support to expose the bloom filter metadata in apache/arrow-rs#1309 and follow on PRs, but I believe they then implemented the Bloom Filtering in a closed source project (cc @zeevm who might know more)

liukun4515 added the bug Something isn't working label Jul 25, 2022

This was referenced Jul 25, 2022

test: add test for decimal and pruning for decimal column #2960

Merged

fix: support decimal statistic for row group prune #2966

Merged

test: add file/SQL level test for pruning parquet row group with decimal data type. #2977

Merged

alamb closed this as completed in #2966 Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't filter rowgroup for parquet prune for some data type #2962

Can't filter rowgroup for parquet prune for some data type #2962

liukun4515 commented Jul 25, 2022

mingmwang commented Jul 28, 2022

alamb commented Jul 28, 2022

Can't filter rowgroup for parquet prune for some data type #2962

Can't filter rowgroup for parquet prune for some data type #2962

Comments

liukun4515 commented Jul 25, 2022

mingmwang commented Jul 28, 2022

alamb commented Jul 28, 2022