-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace filter_row_groups
with ReadOptions
in parquet SerializedFileReader
#1389
Conversation
ReadOptions
with builder API, filter row groups that satisfy all filters, and enable filter row groups by range.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1389 +/- ##
==========================================
+ Coverage 83.03% 83.12% +0.09%
==========================================
Files 181 181
Lines 52936 53329 +393
==========================================
+ Hits 43955 44332 +377
- Misses 8981 8997 +16 ☔ View full report in Codecov by Sentry. |
Integration test failure seems unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yjshen . Looks great to me!
self | ||
} | ||
|
||
/// Add a range predicate on filtering row groups if their midpoints are within the range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe indicate whether the start and end is inclusive or exclusive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is important. I've updated the doc.
/// Get midpoint offset for a row group | ||
fn get_midpoint_offset(meta: &RowGroupMetaData) -> i64 { | ||
let col = meta.column(0); | ||
let mut offset = col.data_page_offset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For encrypted Parquet files we'll need to use file_offset
but it's fine for now since it's not supported anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ReadOptions
with builder API, filter row groups that satisfy all filters, and enable filter row groups by range.ReadOptions
with builder API, for parquet filter row groups that satisfy all filters, and enable filter row groups by range.
Thanks @yjshen |
ReadOptions
with builder API, for parquet filter row groups that satisfy all filters, and enable filter row groups by range.filter_row_groups
with ReadOptions
in parquet SerializedFileReader
Which issue does this PR close?
Closes #158.
Rationale for this change
To support parallel parquet reading at row group level.
What changes are included in this PR?
One extra parameter while filtering row groups using row group metadata.
The midpoint and range comparison are from parquet-mr
ParquetInputSplit
semantic by selecting belonged row groups.Are there any user-facing changes?
Yes.