Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 31, 2025

Which issue does this PR close?

Rationale for this change

This is a first attempt at an object to help control the parsing of the Parquet metadata.

What changes are included in this PR?

Adds a new MetadataOptions struct, and plumbs it down into the Thrift decoder code. The only option for now is to pass in a schema, which then causes the decoder to skip decoding the schema contained in the footer.

Also adds to the metadata bench to demonstrate the time savings from reusing the schema.

Are these changes tested?

Yes, adds a new test.

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 31, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Oct 31, 2025

Here's an excerpt from a run of the new benchmark that shows the schema is actually skipped.

decode parquet metadata time:   [14.401 µs 14.436 µs 14.475 µs]
decode metadata with schema
                        time:   [7.1264 µs 7.1461 µs 7.1677 µs]
decode parquet metadata (wide)
                        time:   [48.440 ms 48.828 ms 49.445 ms]
decode metadata (wide) with schema
                        time:   [43.212 ms 43.452 ms 43.793 ms]

This should get even faster with the metadata index (#8714)

Comment on lines +370 to +371
.with_column_index_policy(self.column_index)
.with_metadata_options(self.metadata_options.clone());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I could see moving the page index policy into the MetadataOptions and then deprecating a bunch of setters.

@etseidl etseidl changed the title Metadata options Add options to control various aspects of Parquet metadata decoding Oct 31, 2025
// the credentials and keys needed to decrypt metadata
file_decryption_properties: Option<Arc<FileDecryptionProperties>>,
// metadata parsing options
metadata_options: Option<MetadataOptions>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this should be Option<Arc<MetadataOptions>> everywhere.

@etseidl
Copy link
Contributor Author

etseidl commented Nov 1, 2025

This may help with #5999

/// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder
#[derive(Default, Debug, Clone)]
pub struct MetadataOptions {
schema_descr: Option<SchemaDescPtr>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this means (1) User provided schema or (2) only (min, max, etc) columns in schema_descr be decoded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's (1). Say you have a large number of files that share the same schema, there's no need to decode them all. Just grab the schema from the first file and use it for all the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a ticket that explains the use case a bit more;

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API looks good to me (and actually closes an existing ticket)

/// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder
#[derive(Default, Debug, Clone)]
pub struct MetadataOptions {
schema_descr: Option<SchemaDescPtr>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a ticket that explains the use case a bit more;

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

(I didn't approve it b/c it is still marked as a draft)

@etseidl
Copy link
Contributor Author

etseidl commented Nov 5, 2025

(I didn't approve it b/c it is still marked as a draft)

Thanks @alamb. I'm still messing around with the API. I think I like hiding the new options object in the file reader APIs, and just exposing it for the metadata readers. The last wrinkle is figuring out a good way to share across the ParquetMetaDataReader, ParquetMetaDataPushDecoder, and MetadataParser, with an eye towards sucking the page index policy stuff into the new options object.

@etseidl etseidl marked this pull request as ready for review November 5, 2025 21:35
@etseidl
Copy link
Contributor Author

etseidl commented Nov 5, 2025

Ok, I think this is ready now.

Right now I'm mildly against pulling in the page index policies. They are used at a higher level and I don't think it's worth the thrash to move them. Instead I want to focus on options that impact the FileMetaData parsing (skip stats, transform page encoding stats, etc), and then work in the metadata index to further accelerate the decoding.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl - this looks great to me

supplied_schema: Option<SchemaRef>,
/// Policy for reading offset and column indexes.
pub(crate) page_index_policy: PageIndexPolicy,
/// Options to control reading of Parquet metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the ArrowReaderOptions and ArrowReaderMetadata structures and their use, and I agree this is the appropriate structure to add metadata parsing to.

Do you think it eventually makes sense to move the other fields from ArrowReaderOptions to ParquetMetaDataOptions? (e.g. supplied_schema)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking perhaps the page_index_policy, but the other things in ArrowReaderOptions are more Arrow specific rather than Parquet. That might get confusing.

/// [Parquet Spec]: https://github.com/apache/parquet-format#metadata
pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
decode_metadata(buf)
decode_metadata(buf, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should start directing people to the push metadata decoder (the metadata reader is getting pretty complicated...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be nice. Maintaining two public APIs that do pretty much the same thing is a bit too much.

Copy link
Contributor Author

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @alamb. There's one last thing I have a question on that maybe you could opine on. 🙏

Comment on lines +46 to +56
/// Provide a schema to use when decoding the metadata.
pub fn set_schema(&mut self, val: SchemaDescPtr) {
self.schema_descr = Some(val);
}

/// Provide a schema to use when decoding the metadata. Returns `Self` for chaining.
pub fn with_schema(mut self, val: SchemaDescPtr) -> Self {
self.schema_descr = Some(val);
self
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much I like having a setter and chaining mutator for the same object. It's easier to use the setter from higher level objects like ArrowReaderOptions, but nicer to have the more builder-like form when directly constructing a ParquetMetaDataOptions to pass to the metadata decoders.

I'm going to look into a macro to cut down on the visual bloat (if not actual code bloat).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to look into a macro to cut down on the visual bloat (if not actual code bloat).

I personally think it is ok to have two sets of APIs for setting things -- while it is visual bloat as you say I think the methods are so simple it is fairly easy (if repetitive) to understand

While a macro might be clever I worry it would make the code that much harder to understand 🤷 -- probably a matter of opinion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Way to share SchemaDescriptorPtr across ParquetMetadata objects

3 participants