Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup reading page index (#4149) (#4090) #4151

Merged
merged 4 commits into from
Apr 28, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #4149
Precursor to #4090

Rationale for this change

This performs some cleanup on the logic for reading the page index, in preparation for adding an async API. It also eliminates the assumption of contiguous ranges (#4149).

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 27, 2023
Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks cleaner.

Two nitpicks:

  1. docstring (see comment)
  2. can we have a regression test for Parquet Page Index Reader Assumes Consecutive Offsets #4149?

use thrift::protocol::{TCompactInputProtocol, TSerializable};

/// Computes the aggregate range of two optional ranges
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an "aggregate range"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to clarify in 3809a4d

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a very nice cleanup to me. Thank you @tustvold

I also verified the test coverage. Without the changes in this PR the test fails like this:

---- file::serialized_reader::tests::test_page_index_reader_out_of_order stdout ----
thread 'file::serialized_reader::tests::test_page_index_reader_out_of_order' panicked at 'called `Result::unwrap()` on an `Err` value: External(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', parquet/src/file/serialized_reader.rs:1400:65


@@ -565,6 +566,13 @@ impl ColumnChunkMetaData {
self.column_index_length
}

/// Returns the range for the offset index if any
pub(crate) fn column_index_range(&self) -> Option<Range<usize>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make these functions pubic and deprecate the offset_index_offset() and offset_index_length() column_index_offset() and column_index_length() functions 🤔 `

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely something to consider as I iterate on these APIs


if let Some(fetch) = fetch {
let bytes = input.get_bytes(fetch.clone()).await?;
let bytes = |r: Range<usize>| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused at first as the bytes closure shadowed the bytes actual data

Maybe calling the closure something like get_bytes or bytes_from_range would be more readable 🤔

This same pattern appears several more times below in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get looks better - thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Page Index Reader Assumes Consecutive Offsets
3 participants