Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet ArrayReader should allow reading a subset of row groups #158

Closed
alamb opened this issue Apr 26, 2021 · 1 comment · Fixed by #1389
Closed

Parquet ArrayReader should allow reading a subset of row groups #158

alamb opened this issue Apr 26, 2021 · 1 comment · Fixed by #1389
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11016

Parquet ArrayReader currently only supports reading an entire file from start to finish and does not allow selectively reading a subset of row groups. This prevents us from parallelizing work across threads when processing a single parquet file.

@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Andy Grove(andygrove) @ 2020-12-23T16:28:31.604+0000:

[~nevi_me] [~sunchao] Do either of you know if this would be a lot of work to implement or not? If you have any pointers for someone working on this ticket it would be appreciated.

Comment from Neville Dipale(nevi_me) @ 2020-12-24T19:37:44.895+0000:

I'm not yet sure about what it would take to read the same file across threads. Would that be like sharing the file handle, as that might not work on all OSes.

 

From working with the arrow -> parquet IO, I think that if the batch size is aligned with record group size, it might make it easier to read partial data from 1 or more groups.

I'll have a look in the coming days, and update this ticket with more info. It might likely be a case of exposing some method to do partial reads.

Comment from Chao Sun(csun) @ 2020-12-29T01:44:48.654+0000:

Sorry for the late reply. Yes I think it should be possible. On the file reader side we can pass in a (start, end) besides the file handle, to indicate we want to only read a segment of the file. Then after parsing the file metadata, we can check all the row groups for the file and determine which row group(s) overlaps with the segment, and only select those. 

You can probably check relevant code in [Spark|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105] and [Parquet|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1223] for reference.

I'm not sure about the file handle sharing issue [~nevi_me] mentioned tho - I thought we used to clone file handle so that they can be shared but yeah haven't looked at the code base for some time :(

@jorgecarleitao jorgecarleitao added enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants