Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

alamb · 2025-09-12T19:46:51Z

Which issue does this PR close?

Rationale for this change

The current ParquetMetadataDecoder intermixes three things:

The state machine for decoding parquet metadata (footer, then metadata, then (optional) indexes)
orchestrating IO (aka calling read, etc)
Decoding thrift encoded byte into objects

This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases

Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder

What changes are included in this PR?

Extract decoding state machine into PushMetadataDecoder
Extract thrift parsing into its own parser module
Update ParquetMetadataDecoder to use the PushMetadataDecoder
Extract the bytes --> object code into its own module

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

Are these changes tested?

by existing tests

Are there any user-facing changes?

Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example

etseidl · 2025-09-12T20:11:21Z

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

alamb · 2025-09-12T20:14:10Z

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

etseidl · 2025-09-12T20:17:56Z

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles.

etseidl · 2025-09-15T16:04:55Z

I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀

parquet/src/file/metadata/parser.rs

alamb · 2025-09-20T10:55:24Z

parquet/src/file/metadata/push_decoder.rs

-            return Ok(DecodeResult::NeedsData(vec![file_len - 8..file_len]));
+        let footer_len = FOOTER_SIZE as u64;
+        loop {
+            match std::mem::replace(&mut self.state, DecodeState::Intermediate) {


Here is the core state machine that makes it very clear, in my mind, what is happening.

I am quite pleased with how this decoder state machine is looking

alamb · 2025-09-24T18:25:22Z

Ok, I am now pretty happy with this PR and how it looks. I broke it up into a few PRs to make reviews easier

You can see the results in this PR as the last commit

If/when those PRs are merged I'll rebase this one and mark it as ready for review

# Which issue does this PR close? - Part of #8000 - Prep PR for #8340, to make it easier to review Note while this is a large (in line count) code change, it should be relatively easy to review as it is just moving code around # Rationale for this change In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. The first part of this is simply to move the code that handles the "decode thrift into Rust structures" into its own module. # What changes are included in this PR? 1. Move most of the "parse thrift bytes into rust structure" code from `parquet/src/file/metadata/mod.rs ` to `parquet/src/file/metadata/parser.rs` # Are these changes tested? yes, by CI # Are there any user-facing changes? No, this is entirely internal reorganization --------- Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>

# Which issue does this PR close? - Part of #8000 - Prep PR for #8340, to make it easier to review # Rationale for this change In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. I want to make it as easy as possible to review so I split it into pieces, but you can see #8340 for how it all fits together # What changes are included in this PR? This PR cleans up the code that handles parsing the 8 byte parquet file footer, `FooterTail`, into its own module and construtor # Are these changes tested? yes, by CI # Are there any user-facing changes? No, this is entirely internal reorganization and I left a `pub use` --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>

alamb · 2025-09-26T13:32:43Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/refactor_push_decoder (4fb5ce5) to 6ecbd62 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_refactor_push_decoder
Results will be posted here when complete

alamb · 2025-09-26T13:36:20Z

🤖: Benchmark completed

Details

group                                 alamb_refactor_push_decoder            main
-----                                 ---------------------------            ----
decode parquet metadata               1.01     25.7±0.30µs        ? ?/sec    1.00     25.5±0.22µs        ? ?/sec
decode parquet metadata (wide)        1.03    154.5±3.73ms        ? ?/sec    1.00    149.7±2.00ms        ? ?/sec
decode thrift file metadata           1.01     16.9±0.05µs        ? ?/sec    1.00     16.8±0.10µs        ? ?/sec
decode thrift file metadata (wide)    1.01    107.1±0.52ms        ? ?/sec    1.00    106.4±0.30ms        ? ?/sec
open(default)                         1.04     24.3±0.09µs        ? ?/sec    1.00     23.4±0.28µs        ? ?/sec
open(page index)                      1.01   1297.6±2.40µs        ? ?/sec    1.00  1279.8±10.47µs        ? ?/sec
page headers                          1.00      7.4±0.05µs        ? ?/sec    1.00      7.3±0.02µs        ? ?/sec

parquet/src/file/metadata/parser.rs

alamb · 2025-09-24T17:05:26Z

parquet/src/file/metadata/reader.rs

    metadata_size: Option<usize>,
    #[cfg(feature = "encryption")]
-    file_decryption_properties: Option<FileDecryptionProperties>,
+    file_decryption_properties: Option<std::sync::Arc<FileDecryptionProperties>>,


The FileDecryptionProperties is currently copied, which is unfortunately.

As a follow on PR, I plan to update the options elsewhere to use a Arc<FileDecryptonProperties> to avoid copies

alamb · 2025-09-26T13:19:55Z

parquet/src/file/metadata/parser.rs

+
+    /// API for decoding metadata that may be encrypted
+    #[derive(Debug, Default)]
+    pub(crate) struct MetadataParser {


I am thinking we can eventually use this structure as the place to hang more detailed decoding instructions (like "only decode statistics for column A" on)

alamb · 2025-09-26T13:36:17Z

parquet/src/file/metadata/push_decoder.rs

+
+    /// Create a decoder with the given `ParquetMetaData` already known.
+    ///
+    /// This can be used to parse and populate the page index structures


I think this is now a nice API to load/decode PageIndexes -- provide an existing ParquetMetadata and then this decoder figures out what bytes are needed and parses them. If we ever want to extend ParquetMetadata to include, for example, BloomFilters, we could use the same basic idea

alamb · 2025-09-26T13:38:42Z

parquet/src/file/metadata/reader.rs

-        // Get bounds needed for page indexes (if any are present in the file).
-        let Some(range) = self.range_for_page_index() else {
-            return Ok(());
+        let Some(metadata) = self.metadata.take() else {


I had hoped we would be able to remove more of the logic from ParquetMetadataReader but I couldn't figure out how to do so given the somewhat complex way it supports reading metadata even when the file length isn't known

alamb · 2025-09-26T14:42:32Z

This PR is now ready for review

alamb · 2025-09-26T15:01:45Z

Looks like the benchmark differences are noise. I have an idea to reduce some allocations though, which I will push up here

etseidl · 2025-09-26T15:35:35Z

I'm going to try merging this into my remodel branch and see what comes up.

etseidl

Love where this is heading! 🚀

etseidl · 2025-09-26T15:57:38Z

parquet/src/file/metadata/parser.rs

+                    "Parquet file has an encrypted footer but the encryption feature is disabled"
+                ))
+            } else {
+                decode_metadata(buf)


This is the only problematic line for the merge. For my initial pass I replaced this call with the thrift decode, but on second thought I should just change the implementation of decode_metadata below to use the new structs.

I don't fully understand your description of the problem. Do you mean you inlined the contents of decode_metadata or something?

Is there anything I can do to make the pattern more amenable to the thrift-remodel branch?

Sorry, this was mostly a note to myself. When I did the merge I changed the decode_metadata call to

let mut prot = ThriftSliceInputProtocol::new(buf); ParquetMetaData::read_thrift(&mut prot)

Instead I should do the same in parser::decode_metadata.

No changes on your end are necessary 😄

etseidl · 2025-09-26T15:58:11Z

parquet/src/file/metadata/parser.rs

 /// Parses column orders from Thrift definition.
 /// If no column orders are defined, returns `None`.
-pub(crate) fn parse_column_orders(
+fn parse_column_orders(


This will go away, btw.

parquet/src/file/metadata/push_decoder.rs

alamb · 2025-09-29T11:04:34Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/refactor_push_decoder (12bca80) to 4d18401 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_refactor_push_decoder
Results will be posted here when complete

alamb · 2025-09-29T11:10:20Z

🤖: Benchmark completed

Details

group                                 alamb_refactor_push_decoder            main
-----                                 ---------------------------            ----
decode parquet metadata               1.53     40.3±5.80µs        ? ?/sec    1.00     26.2±0.08µs        ? ?/sec
decode parquet metadata (wide)        1.47    263.9±9.26ms        ? ?/sec    1.00   179.2±39.86ms        ? ?/sec
decode thrift file metadata           1.63     27.8±1.42µs        ? ?/sec    1.00     17.1±0.10µs        ? ?/sec
decode thrift file metadata (wide)    1.14   131.8±30.54ms        ? ?/sec    1.00   115.5±20.20ms        ? ?/sec
open(default)                         1.71     40.7±2.89µs        ? ?/sec    1.00     23.8±0.13µs        ? ?/sec
open(page index)                      1.21  1969.7±93.64µs        ? ?/sec    1.00  1623.4±353.49µs        ? ?/sec
page headers                          1.00      7.4±0.24µs        ? ?/sec    1.15      8.6±1.56µs        ? ?/sec

alamb · 2025-09-29T19:41:45Z

🤖: Benchmark completed

🤔 those benchmark results look really bad. I will investigate

alamb · 2025-09-30T08:36:24Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/refactor_push_decoder (512195b) to 422da15 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_refactor_push_decoder
Results will be posted here when complete

alamb · 2025-09-30T08:40:38Z

🤖: Benchmark completed

Details

group                             alamb_refactor_push_decoder            main
-----                             ---------------------------            ----
decode parquet metadata           1.00     24.7±0.11µs        ? ?/sec    1.02     25.3±0.09µs        ? ?/sec
decode parquet metadata (wide)    1.00    146.7±4.81ms        ? ?/sec    1.01    147.5±4.27ms        ? ?/sec
open(default)                     1.01     23.9±0.10µs        ? ?/sec    1.00     23.6±0.30µs        ? ?/sec
open(page index)                  1.01   1291.8±4.23µs        ? ?/sec    1.00   1275.6±3.97µs        ? ?/sec

alamb · 2025-09-30T10:07:40Z

🤖: Benchmark completed

😅 that looks much better. Let's do this!

alamb · 2025-09-30T10:07:47Z

Thanks again @etseidl

github-actions bot added the parquet Changes to the parquet crate label Sep 12, 2025

alamb changed the title ~~Alamb/refactor push decoder~~ Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder Sep 12, 2025

This was referenced Sep 12, 2025

[thrift-remodel] Begin replacing file metadata reader and convert footer decryption code #8313

Merged

[Parquet] Add ParquetMetadataPushDecoder #8080

Merged

etseidl reviewed Sep 15, 2025

View reviewed changes

parquet/src/file/metadata/parser.rs Show resolved Hide resolved

alamb commented Sep 20, 2025

View reviewed changes

alamb force-pushed the alamb/refactor_push_decoder branch from 86cdf90 to c9ba4e0 Compare September 23, 2025 19:27

This was referenced Sep 24, 2025

Refactor: Move parquet metadata parsing code into its own module #8436

Merged

Refactor: extract FooterTail from ParquetMetadataReader #8437

Merged

alamb force-pushed the alamb/refactor_push_decoder branch 3 times, most recently from e8ff5cb to fc2fd81 Compare September 24, 2025 18:23

This comment was marked as outdated.

Sign in to view

alamb force-pushed the alamb/refactor_push_decoder branch from fc2fd81 to 12bccec Compare September 26, 2025 13:18

This comment was marked as outdated.

Sign in to view

Move state machine into ParquetMetadataDecoder

533f465

alamb force-pushed the alamb/refactor_push_decoder branch from 4fb5ce5 to 533f465 Compare September 26, 2025 13:47

alamb mentioned this pull request Sep 26, 2025

[Parquet] Reduce size of ParquetMetadata when encryption feature is enabled #8469

Open

alamb marked this pull request as ready for review September 26, 2025 14:42

alamb requested a review from etseidl September 26, 2025 14:42

alamb commented Sep 26, 2025

View reviewed changes

Merge branch 'main' into alamb/refactor_push_decoder

fdc9f3f

This comment was marked as outdated.

Sign in to view

Reduce temporary Vecs

0478cf0

etseidl approved these changes Sep 26, 2025

View reviewed changes

alamb added 2 commits September 26, 2025 14:03

Use crate::error::Result

95fffc5

Improve comments about when page index policy is checked

12bca80

Merge branch 'main' into alamb/refactor_push_decoder

512195b

alamb merged commit 8eca76d into apache:main Sep 30, 2025
16 checks passed

alamb deleted the alamb/refactor_push_decoder branch September 30, 2025 10:07

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Uh oh!

Conversation

alamb commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

alamb commented Sep 12, 2025

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

etseidl commented Sep 15, 2025

Uh oh!

Uh oh!

alamb Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 24, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

alamb commented Sep 26, 2025

Uh oh!

alamb commented Sep 26, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 26, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

alamb commented Sep 26, 2025

Uh oh!

etseidl commented Sep 26, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Sep 29, 2025

Uh oh!

alamb commented Sep 29, 2025

Uh oh!

alamb commented Sep 29, 2025

Uh oh!

alamb commented Sep 30, 2025

Uh oh!

alamb commented Sep 30, 2025

Uh oh!

alamb commented Sep 30, 2025

Uh oh!

Uh oh!

alamb commented Sep 30, 2025

Uh oh!

Uh oh!

alamb commented Sep 12, 2025 •

edited

Loading

alamb Sep 20, 2025 •

edited

Loading