-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup reading page index (#4149) (#4090) #4151
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks cleaner.
Two nitpicks:
- docstring (see comment)
- can we have a regression test for Parquet Page Index Reader Assumes Consecutive Offsets #4149?
use thrift::protocol::{TCompactInputProtocol, TSerializable}; | ||
|
||
/// Computes the aggregate range of two optional ranges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an "aggregate range"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to clarify in 3809a4d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a very nice cleanup to me. Thank you @tustvold
I also verified the test coverage. Without the changes in this PR the test fails like this:
---- file::serialized_reader::tests::test_page_index_reader_out_of_order stdout ----
thread 'file::serialized_reader::tests::test_page_index_reader_out_of_order' panicked at 'called `Result::unwrap()` on an `Err` value: External(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', parquet/src/file/serialized_reader.rs:1400:65
@@ -565,6 +566,13 @@ impl ColumnChunkMetaData { | |||
self.column_index_length | |||
} | |||
|
|||
/// Returns the range for the offset index if any | |||
pub(crate) fn column_index_range(&self) -> Option<Range<usize>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make these functions pubic and deprecate the offset_index_offset()
and offset_index_length()
column_index_offset()
and column_index_length()
functions 🤔 `
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely something to consider as I iterate on these APIs
|
||
if let Some(fetch) = fetch { | ||
let bytes = input.get_bytes(fetch.clone()).await?; | ||
let bytes = |r: Range<usize>| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused at first as the bytes
closure shadowed the bytes
actual data
Maybe calling the closure something like get_bytes
or bytes_from_range
would be more readable 🤔
This same pattern appears several more times below in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get
looks better - thank you
Which issue does this PR close?
Closes #4149
Precursor to #4090
Rationale for this change
This performs some cleanup on the logic for reading the page index, in preparation for adding an async API. It also eliminates the assumption of contiguous ranges (#4149).
What changes are included in this PR?
Are there any user-facing changes?