-
Notifications
You must be signed in to change notification settings - Fork 1k
Add options to control various aspects of Parquet metadata decoding #8763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Here's an excerpt from a run of the new benchmark that shows the schema is actually skipped. This should get even faster with the metadata index (#8714) |
| .with_column_index_policy(self.column_index) | ||
| .with_metadata_options(self.metadata_options.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I could see moving the page index policy into the MetadataOptions and then deprecating a bunch of setters.
parquet/src/file/metadata/parser.rs
Outdated
| // the credentials and keys needed to decrypt metadata | ||
| file_decryption_properties: Option<Arc<FileDecryptionProperties>>, | ||
| // metadata parsing options | ||
| metadata_options: Option<MetadataOptions>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this should be Option<Arc<MetadataOptions>> everywhere.
|
This may help with #5999 |
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct MetadataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this means (1) User provided schema or (2) only (min, max, etc) columns in schema_descr be decoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's (1). Say you have a large number of files that share the same schema, there's no need to decode them all. Just grab the schema from the first file and use it for all the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a ticket that explains the use case a bit more;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this API looks good to me (and actually closes an existing ticket)
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct MetadataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a ticket that explains the use case a bit more;
|
(I didn't approve it b/c it is still marked as a draft) |
Thanks @alamb. I'm still messing around with the API. I think I like hiding the new options object in the file reader APIs, and just exposing it for the metadata readers. The last wrinkle is figuring out a good way to share across the |
|
Ok, I think this is ready now. Right now I'm mildly against pulling in the page index policies. They are used at a higher level and I don't think it's worth the thrash to move them. Instead I want to focus on options that impact the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl - this looks great to me
| supplied_schema: Option<SchemaRef>, | ||
| /// Policy for reading offset and column indexes. | ||
| pub(crate) page_index_policy: PageIndexPolicy, | ||
| /// Options to control reading of Parquet metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the ArrowReaderOptions and ArrowReaderMetadata structures and their use, and I agree this is the appropriate structure to add metadata parsing to.
Do you think it eventually makes sense to move the other fields from ArrowReaderOptions to ParquetMetaDataOptions? (e.g. supplied_schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking perhaps the page_index_policy, but the other things in ArrowReaderOptions are more Arrow specific rather than Parquet. That might get confusing.
| /// [Parquet Spec]: https://github.com/apache/parquet-format#metadata | ||
| pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> { | ||
| decode_metadata(buf) | ||
| decode_metadata(buf, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should start directing people to the push metadata decoder (the metadata reader is getting pretty complicated...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be nice. Maintaining two public APIs that do pretty much the same thing is a bit too much.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @alamb. There's one last thing I have a question on that maybe you could opine on. 🙏
| /// Provide a schema to use when decoding the metadata. | ||
| pub fn set_schema(&mut self, val: SchemaDescPtr) { | ||
| self.schema_descr = Some(val); | ||
| } | ||
|
|
||
| /// Provide a schema to use when decoding the metadata. Returns `Self` for chaining. | ||
| pub fn with_schema(mut self, val: SchemaDescPtr) -> Self { | ||
| self.schema_descr = Some(val); | ||
| self | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much I like having a setter and chaining mutator for the same object. It's easier to use the setter from higher level objects like ArrowReaderOptions, but nicer to have the more builder-like form when directly constructing a ParquetMetaDataOptions to pass to the metadata decoders.
I'm going to look into a macro to cut down on the visual bloat (if not actual code bloat).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to look into a macro to cut down on the visual bloat (if not actual code bloat).
I personally think it is ok to have two sets of APIs for setting things -- while it is visual bloat as you say I think the methods are so simple it is fairly easy (if repetitive) to understand
While a macro might be clever I worry it would make the code that much harder to understand 🤷 -- probably a matter of opinion
Which issue does this PR close?
SchemaDescriptorPtracrossParquetMetadataobjects #5999Rationale for this change
This is a first attempt at an object to help control the parsing of the Parquet metadata.
What changes are included in this PR?
Adds a new
MetadataOptionsstruct, and plumbs it down into the Thrift decoder code. The only option for now is to pass in a schema, which then causes the decoder to skip decoding the schema contained in the footer.Also adds to the metadata bench to demonstrate the time savings from reusing the schema.
Are these changes tested?
Yes, adds a new test.
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.