Add options to control various aspects of Parquet metadata decoding #8763

etseidl · 2025-10-31T22:38:37Z

Which issue does this PR close?

Rationale for this change

This is a first attempt at an object to help control the parsing of the Parquet metadata.

What changes are included in this PR?

Adds a new MetadataOptions struct, and plumbs it down into the Thrift decoder code. The only option for now is to pass in a schema, which then causes the decoder to skip decoding the schema contained in the footer.

Also adds to the metadata bench to demonstrate the time savings from reusing the schema.

Are these changes tested?

Yes, adds a new test.

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

etseidl · 2025-10-31T22:54:34Z

Here's an excerpt from a run of the new benchmark that shows the schema is actually skipped.

decode parquet metadata time:   [14.401 µs 14.436 µs 14.475 µs]
decode metadata with schema
                        time:   [7.1264 µs 7.1461 µs 7.1677 µs]
decode parquet metadata (wide)
                        time:   [48.440 ms 48.828 ms 49.445 ms]
decode metadata (wide) with schema
                        time:   [43.212 ms 43.452 ms 43.793 ms]

This should get even faster with the metadata index (#8714)

etseidl · 2025-10-31T22:56:27Z

parquet/src/file/metadata/reader.rs

+            .with_column_index_policy(self.column_index)
+            .with_metadata_options(self.metadata_options.clone());


At some point I could see moving the page index policy into the MetadataOptions and then deprecating a bunch of setters.

etseidl · 2025-10-31T23:47:31Z

parquet/src/file/metadata/parser.rs

        // the credentials and keys needed to decrypt metadata
        file_decryption_properties: Option<Arc<FileDecryptionProperties>>,
+        // metadata parsing options
+        metadata_options: Option<MetadataOptions>,


Wondering if this should be Option<Arc<MetadataOptions>> everywhere.

etseidl · 2025-11-01T00:08:21Z

This may help with #5999

mapleFU · 2025-11-01T03:55:54Z

parquet/src/file/metadata/options.rs

+/// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder
+#[derive(Default, Debug, Clone)]
+pub struct MetadataOptions {
+    schema_descr: Option<SchemaDescPtr>,


Does this means (1) User provided schema or (2) only (min, max, etc) columns in schema_descr be decoded?

It's (1). Say you have a large number of files that share the same schema, there's no need to decode them all. Just grab the schema from the first file and use it for all the others.

Here is a ticket that explains the use case a bit more;

Way to share SchemaDescriptorPtr across ParquetMetadata objects #5999

alamb

this API looks good to me (and actually closes an existing ticket)

alamb · 2025-11-05T16:21:40Z

parquet/src/file/metadata/options.rs

+/// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder
+#[derive(Default, Debug, Clone)]
+pub struct MetadataOptions {
+    schema_descr: Option<SchemaDescPtr>,


Here is a ticket that explains the use case a bit more;

Way to share SchemaDescriptorPtr across ParquetMetadata objects #5999

alamb · 2025-11-05T16:22:05Z

(I didn't approve it b/c it is still marked as a draft)

etseidl · 2025-11-05T20:15:10Z

(I didn't approve it b/c it is still marked as a draft)

Thanks @alamb. I'm still messing around with the API. I think I like hiding the new options object in the file reader APIs, and just exposing it for the metadata readers. The last wrinkle is figuring out a good way to share across the ParquetMetaDataReader, ParquetMetaDataPushDecoder, and MetadataParser, with an eye towards sucking the page index policy stuff into the new options object.

etseidl · 2025-11-05T21:53:38Z

Ok, I think this is ready now.

Right now I'm mildly against pulling in the page index policies. They are used at a higher level and I don't think it's worth the thrash to move them. Instead I want to focus on options that impact the FileMetaData parsing (skip stats, transform page encoding stats, etc), and then work in the metadata index to further accelerate the decoding.

alamb

Thank you @etseidl - this looks great to me

alamb · 2025-11-06T20:14:32Z

parquet/src/arrow/arrow_reader/mod.rs

    supplied_schema: Option<SchemaRef>,
    /// Policy for reading offset and column indexes.
    pub(crate) page_index_policy: PageIndexPolicy,
+    /// Options to control reading of Parquet metadata


I reviewed the ArrowReaderOptions and ArrowReaderMetadata structures and their use, and I agree this is the appropriate structure to add metadata parsing to.

Do you think it eventually makes sense to move the other fields from ArrowReaderOptions to ParquetMetaDataOptions? (e.g. supplied_schema)

I was thinking perhaps the page_index_policy, but the other things in ArrowReaderOptions are more Arrow specific rather than Parquet. That might get confusing.

parquet/src/file/metadata/options.rs

alamb · 2025-11-06T20:22:58Z

parquet/src/file/metadata/reader.rs

    /// [Parquet Spec]: https://github.com/apache/parquet-format#metadata
    pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
-        decode_metadata(buf)
+        decode_metadata(buf, None)


I wonder if we should start directing people to the push metadata decoder (the metadata reader is getting pretty complicated...)

Yes, that would be nice. Maintaining two public APIs that do pretty much the same thing is a bit too much.

parquet/src/file/metadata/thrift/mod.rs

parquet/src/file/serialized_reader.rs

parquet/src/arrow/arrow_reader/mod.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

etseidl

Thanks for the review @alamb. There's one last thing I have a question on that maybe you could opine on. 🙏

etseidl · 2025-11-06T21:33:22Z

parquet/src/file/metadata/options.rs

+    /// Provide a schema to use when decoding the metadata.
+    pub fn set_schema(&mut self, val: SchemaDescPtr) {
+        self.schema_descr = Some(val);
+    }
+
+    /// Provide a schema to use when decoding the metadata. Returns `Self` for chaining.
+    pub fn with_schema(mut self, val: SchemaDescPtr) -> Self {
+        self.schema_descr = Some(val);
+        self
+    }
+}


I'm not sure how much I like having a setter and chaining mutator for the same object. It's easier to use the setter from higher level objects like ArrowReaderOptions, but nicer to have the more builder-like form when directly constructing a ParquetMetaDataOptions to pass to the metadata decoders.

I'm going to look into a macro to cut down on the visual bloat (if not actual code bloat).

I'm going to look into a macro to cut down on the visual bloat (if not actual code bloat).

I personally think it is ok to have two sets of APIs for setting things -- while it is visual bloat as you say I think the methods are so simple it is fairly easy (if repetitive) to understand

While a macro might be clever I worry it would make the code that much harder to understand 🤷 -- probably a matter of opinion

etseidl added 3 commits October 31, 2025 15:32

first cut at options object

8fce6fd

add metadata options to ParquetMetaDataReader

06211ae

add test

c0efb38

github-actions bot added the parquet Changes to the parquet crate label Oct 31, 2025

clippy

c6ec5b2

etseidl mentioned this pull request Oct 31, 2025

Pass options to Parquet metadata readers #8643

Open

etseidl commented Oct 31, 2025

View reviewed changes

etseidl changed the title ~~Metadata options~~ Add options to control various aspects of Parquet metadata decoding Oct 31, 2025

etseidl commented Nov 1, 2025

View reviewed changes

mapleFU reviewed Nov 1, 2025

View reviewed changes

etseidl added 8 commits November 2, 2025 21:46

add metadata opts to ArrowReaderOptions and ReadOptions

9fcb348

make setting options more ergonomic

db7e750

add more tests

a7f3596

Merge remote-tracking branch 'origin/main' into metadata_options

988c52e

refactor to improve ergonomics

7f7506f

Merge remote-tracking branch 'origin/main' into metadata_options

acd72c3

clippy fixes

3edfdfd

typo

f6d81fe

alamb mentioned this pull request Nov 5, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 apache/datafusion#18486

Open

38 tasks

alamb reviewed Nov 5, 2025

View reviewed changes

etseidl added 3 commits November 5, 2025 12:01

rename to ParquetMetaDataOptions

7a05ac6

Merge remote-tracking branch 'origin/main' into metadata_options

a0c369a

post merge fix

91cd1c6

etseidl added 3 commits November 5, 2025 12:46

refactor missed docstrings

beadfa6

restore builder-like interface for ParquetMetaDataOptions

78085e9

reword a docstring to be less ambiguous

02a111a

etseidl marked this pull request as ready for review November 5, 2025 21:35

etseidl mentioned this pull request Nov 6, 2025

[WIP] Add ability to skip or transform page encoding statistics in Parquet metadata #8797

Draft

alamb approved these changes Nov 6, 2025

View reviewed changes

alamb reviewed Nov 6, 2025

View reviewed changes

parquet/src/arrow/arrow_reader/mod.rs Show resolved Hide resolved

alamb mentioned this pull request Nov 6, 2025

[Parquet] Add additional docs for ArrowReaderOptions and ArrowReaderMetadata #8798

Open

etseidl and others added 4 commits November 6, 2025 13:13

implement suggestions from review

0ec3f1b

Apply suggestions from code review

d27a250

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

lint

f8ca8d9

clippy

a3f0f6d

etseidl commented Nov 6, 2025

View reviewed changes

		.with_column_index_policy(self.column_index)
		.with_metadata_options(self.metadata_options.clone());

Add options to control various aspects of Parquet metadata decoding #8763

Are you sure you want to change the base?

Add options to control various aspects of Parquet metadata decoding #8763

Conversation

etseidl commented Oct 31, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl commented Nov 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 5, 2025

Uh oh!

etseidl commented Nov 5, 2025

Uh oh!

etseidl commented Nov 5, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

etseidl commented Oct 31, 2025 •

edited by alamb

Loading

etseidl commented Oct 31, 2025 •

edited

Loading