Parquet ArrayReader should allow reading a subset of row groups #158

alamb · 2021-04-26T12:43:23Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11016

Parquet ArrayReader currently only supports reading an entire file from start to finish and does not allow selectively reading a subset of row groups. This prevents us from parallelizing work across threads when processing a single parquet file.

alamb · 2021-04-26T12:43:25Z

Comment from Andy Grove(andygrove) @ 2020-12-23T16:28:31.604+0000:

[~nevi_me] [~sunchao] Do either of you know if this would be a lot of work to implement or not? If you have any pointers for someone working on this ticket it would be appreciated.

Comment from Neville Dipale(nevi_me) @ 2020-12-24T19:37:44.895+0000:

I'm not yet sure about what it would take to read the same file across threads. Would that be like sharing the file handle, as that might not work on all OSes.

 

From working with the arrow -> parquet IO, I think that if the batch size is aligned with record group size, it might make it easier to read partial data from 1 or more groups.

I'll have a look in the coming days, and update this ticket with more info. It might likely be a case of exposing some method to do partial reads.

Comment from Chao Sun(csun) @ 2020-12-29T01:44:48.654+0000:

Sorry for the late reply. Yes I think it should be possible. On the file reader side we can pass in a (start, end) besides the file handle, to indicate we want to only read a segment of the file. Then after parsing the file metadata, we can check all the row groups for the file and determine which row group(s) overlaps with the segment, and only select those. 

You can probably check relevant code in [Spark|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105] and [Parquet|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1223] for reference.

I'm not sure about the file handle sharing issue [~nevi_me] mentioned tho - I thought we used to clone file handle so that they can be shared but yeah haven't looked at the code base for some time :(

alamb added the arrow Changes to the arrow crate label Apr 26, 2021

jorgecarleitao added enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 27, 2021

This was referenced Mar 3, 2022

Replace filter_row_groups with ReadOptions in parquet SerializedFileReader #1389

Merged

Avoid repeated open for one single file and simplify object reader API on the sync part apache/datafusion#1905

Closed

alamb closed this as completed in #1389 Mar 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet ArrayReader should allow reading a subset of row groups #158

Parquet ArrayReader should allow reading a subset of row groups #158

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

Parquet ArrayReader should allow reading a subset of row groups #158

Parquet ArrayReader should allow reading a subset of row groups #158

Comments

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021