-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect missing page indexes while reading Parquet metadata #6507
Conversation
@@ -236,3 +236,63 @@ pub fn parquet_column<'a>( | |||
.find(|x| parquet_schema.get_column_root_idx(*x) == root_idx)?; | |||
Some((parquet_idx, field)) | |||
} | |||
|
|||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is copied from #6463
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- this seems like an improvement to me
.err() | ||
.unwrap(); | ||
assert_eq!( | ||
err.to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this certainly is better. Let's see if anyone hits this / is confused
parquet/src/file/metadata/reader.rs
Outdated
@@ -456,7 +471,7 @@ impl ParquetMetaDataReader { | |||
} | |||
|
|||
// one-shot parse of footer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can update the docs to mention it updates self.metadata_size
🤔
Thanks @etseidl |
Which issue does this PR close?
Addresses part of #6464.
Rationale for this change
If reading metadata from an external cache that lacks the page indexes, but the footer metadata contains page index offsets valid for the original source, attempting to read the page indexes can cause a variety of errors.
What changes are included in this PR?
Modifies
ParqetMetaDataReader
to keep track of the size of the footer metadata, and when reading the page indexes detects if the ranges for the two structures overlap. Returns an EOF error if they do.Are there any user-facing changes?
No, changes are to internal APIs only.