-
Notifications
You must be signed in to change notification settings - Fork 1k
Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅 |
Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with. |
Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles. |
I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀 |
return Ok(DecodeResult::NeedsData(vec![file_len - 8..file_len])); | ||
let footer_len = FOOTER_SIZE as u64; | ||
loop { | ||
match std::mem::replace(&mut self.state, DecodeState::Intermediate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the core state machine that makes it very clear, in my mind, what is happening.
I am quite pleased with how this decoder state machine is looking
86cdf90
to
c9ba4e0
Compare
e8ff5cb
to
fc2fd81
Compare
Ok, I am now pretty happy with this PR and how it looks. I broke it up into a few PRs to make reviews easier
You can see the results in this PR as the last commit If/when those PRs are merged I'll rebase this one and mark it as ready for review |
This comment was marked as outdated.
This comment was marked as outdated.
# Which issue does this PR close? - Part of #8000 - Prep PR for #8340, to make it easier to review Note while this is a large (in line count) code change, it should be relatively easy to review as it is just moving code around # Rationale for this change In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. The first part of this is simply to move the code that handles the "decode thrift into Rust structures" into its own module. # What changes are included in this PR? 1. Move most of the "parse thrift bytes into rust structure" code from `parquet/src/file/metadata/mod.rs ` to `parquet/src/file/metadata/parser.rs` # Are these changes tested? yes, by CI # Are there any user-facing changes? No, this is entirely internal reorganization --------- Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
# Which issue does this PR close? - Part of #8000 - Prep PR for #8340, to make it easier to review # Rationale for this change In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. I want to make it as easy as possible to review so I split it into pieces, but you can see #8340 for how it all fits together # What changes are included in this PR? This PR cleans up the code that handles parsing the 8 byte parquet file footer, `FooterTail`, into its own module and construtor # Are these changes tested? yes, by CI # Are there any user-facing changes? No, this is entirely internal reorganization and I left a `pub use` --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
fc2fd81
to
12bccec
Compare
This comment was marked as outdated.
This comment was marked as outdated.
🤖 |
🤖: Benchmark completed Details
|
4fb5ce5
to
533f465
Compare
metadata_size: Option<usize>, | ||
#[cfg(feature = "encryption")] | ||
file_decryption_properties: Option<FileDecryptionProperties>, | ||
file_decryption_properties: Option<std::sync::Arc<FileDecryptionProperties>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FileDecryptionProperties
is currently copied, which is unfortunately.
As a follow on PR, I plan to update the options elsewhere to use a Arc<FileDecryptonProperties>
to avoid copies
|
||
/// API for decoding metadata that may be encrypted | ||
#[derive(Debug, Default)] | ||
pub(crate) struct MetadataParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking we can eventually use this structure as the place to hang more detailed decoding instructions (like "only decode statistics for column A
" on)
|
||
/// Create a decoder with the given `ParquetMetaData` already known. | ||
/// | ||
/// This can be used to parse and populate the page index structures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is now a nice API to load/decode PageIndexes -- provide an existing ParquetMetadata and then this decoder figures out what bytes are needed and parses them. If we ever want to extend ParquetMetadata to include, for example, BloomFilters, we could use the same basic idea
// Get bounds needed for page indexes (if any are present in the file). | ||
let Some(range) = self.range_for_page_index() else { | ||
return Ok(()); | ||
let Some(metadata) = self.metadata.take() else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had hoped we would be able to remove more of the logic from ParquetMetadataReader
but I couldn't figure out how to do so given the somewhat complex way it supports reading metadata even when the file length isn't known
This PR is now ready for review |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Looks like the benchmark differences are noise. I have an idea to reduce some allocations though, which I will push up here |
I'm going to try merging this into my remodel branch and see what comes up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love where this is heading! 🚀
"Parquet file has an encrypted footer but the encryption feature is disabled" | ||
)) | ||
} else { | ||
decode_metadata(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only problematic line for the merge. For my initial pass I replaced this call with the thrift decode, but on second thought I should just change the implementation of decode_metadata
below to use the new structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand your description of the problem. Do you mean you inlined the contents of decode_metadata or something?
Is there anything I can do to make the pattern more amenable to the thrift-remodel branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was mostly a note to myself. When I did the merge I changed the decode_metadata
call to
let mut prot = ThriftSliceInputProtocol::new(buf);
ParquetMetaData::read_thrift(&mut prot)
Instead I should do the same in parser::decode_metadata
.
No changes on your end are necessary 😄
/// Parses column orders from Thrift definition. | ||
/// If no column orders are defined, returns `None`. | ||
pub(crate) fn parse_column_orders( | ||
fn parse_column_orders( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will go away, btw.
🤖 |
🤖: Benchmark completed Details
|
🤔 those benchmark results look really bad. I will investigate |
🤖 |
🤖: Benchmark completed Details
|
😅 that looks much better. Let's do this! |
Thanks again @etseidl |
Which issue does this PR close?
ParquetMetadataReader
into IO/decoder state machine and thrift parsing #8439Rationale for this change
The current ParquetMetadataDecoder intermixes three things:
This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases
Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder
What changes are included in this PR?
parser
moduleThis almost certainly will conflict with @etseidl 's plans in thrift-remodel.
Are these changes tested?
by existing tests
Are there any user-facing changes?
Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example