-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parquet page-level skipping with column index, using min/max stats #847
Comments
I've followed the Spark implementation for a while, and I think this would be a great feature to have. |
Also, pyspark 3.2 supports Parquet column index: |
I believe through the hard work of @Ted-Jiang @thinkharderdev @tustvold and others, this feature is nearing fruition - apache/arrow-rs#1191 and apache/arrow-rs#2270 |
working on it.😊 |
I believe this has been implemented by #3780, feel free to reopen if I have missed anything |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
(this is summarized version of some comments in a discussion thread between @sunchao @jorgecarleitao and @nju_yaho (not sure if that is the correct github handle)
While reading data from parquet files, the more data that can be immediately ruled out without decompressing, the faster the query will go
@sunchao pointed out that the structure of Parquet also allows page-level skipping with column index, using min/max stats, which is pretty effective when data is sorted. The data being sorted is important because otherwise a data page could contain random data within a big range [min, max] and predicates such as
col < 42
won’t be very effective.There is a good blog post about this feature: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Note that page level min/max statistics is a relatively new feature. We only know of parquet-mr and impala which have implemented it. Spark also recently added the support in apache/spark#32753. The page indexes are stored in the column chunk metadata: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L798
Describe the solution you'd like
To achieve this, latest Parquet version has introduced two kinds of indexes:
ColumnIndex
andOffsetIndex
.They are all stored in the file footer for each
ColumnChunk
of eachRowGroup
.For
ColumnIndex
, it includesmin
,max
for each page; while forOffsetIndex
, it includes row ranges, file offset range for each page.For example, to filter by Column A to achieve filtering on column B
In the case Column B above you also need to use row ranges when scanning a page, and skip those rows if they are not within the range. In the case of multiple predicates on different columns, you’d also need to calculate row range intersects or union.
Describe alternatives you've considered
TBD
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: