Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 12, 2025

Which issue does this PR close?

Rationale for this change

The current ParquetMetadataDecoder intermixes three things:

  1. The state machine for decoding parquet metadata (footer, then metadata, then (optional) indexes)
  2. orchestrating IO (aka calling read, etc)
  3. Decoding thrift encoded byte into objects

This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases

Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder

What changes are included in this PR?

  1. Extract decoding state machine into PushMetadataDecoder
  2. Extract thrift parsing into its own parser module
  3. Update ParquetMetadataDecoder to use the PushMetadataDecoder
  4. Extract the bytes --> object code into its own module

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

Are these changes tested?

by existing tests

Are there any user-facing changes?

Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 12, 2025
@alamb alamb changed the title Alamb/refactor push decoder Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder Sep 12, 2025
@etseidl
Copy link
Contributor

etseidl commented Sep 12, 2025

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2025

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

@etseidl
Copy link
Contributor

etseidl commented Sep 12, 2025

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles.

@etseidl
Copy link
Contributor

etseidl commented Sep 15, 2025

I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀

return Ok(DecodeResult::NeedsData(vec![file_len - 8..file_len]));
let footer_len = FOOTER_SIZE as u64;
loop {
match std::mem::replace(&mut self.state, DecodeState::Intermediate) {
Copy link
Contributor Author

@alamb alamb Sep 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the core state machine that makes it very clear, in my mind, what is happening.

I am quite pleased with how this decoder state machine is looking

@alamb
Copy link
Contributor Author

alamb commented Sep 24, 2025

Ok, I am now pretty happy with this PR and how it looks. I broke it up into a few PRs to make reviews easier

You can see the results in this PR as the last commit

If/when those PRs are merged I'll rebase this one and mark it as ready for review

@alamb

This comment was marked as outdated.

alamb added a commit that referenced this pull request Sep 25, 2025
# Which issue does this PR close?

- Part of #8000
- Prep PR for #8340, to make it
easier to review

Note while this is a large (in line count) code change, it should be
relatively easy to review as it is just moving code around

# Rationale for this change

In #8340 I am trying to split the
"IO" from the "where is the metadata in the file" from the "decode
thrift into Rust structures" logic. The first part of this is simply to
move the code that handles the "decode thrift into Rust structures" into
its own module.


# What changes are included in this PR?

1. Move most of the "parse thrift bytes into rust structure" code from
`parquet/src/file/metadata/mod.rs ` to
`parquet/src/file/metadata/parser.rs`

# Are these changes tested?

yes, by CI


# Are there any user-facing changes?

No, this is entirely internal reorganization

---------

Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
alamb added a commit that referenced this pull request Sep 25, 2025
# Which issue does this PR close?

- Part of #8000
- Prep PR for #8340, to make it
easier to review

# Rationale for this change

In #8340 I am trying to split the
"IO" from the "where is the metadata in the file" from the "decode
thrift into Rust structures" logic.

I want to make it as easy as possible to review so I split it into
pieces, but you can see #8340 for
how it all fits together

# What changes are included in this PR?

This PR cleans up the code that handles parsing the 8 byte parquet file
footer, `FooterTail`, into its own module and construtor

# Are these changes tested?

yes, by CI


# Are there any user-facing changes?

No, this is entirely internal reorganization and I left a `pub use`

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
@alamb alamb force-pushed the alamb/refactor_push_decoder branch from fc2fd81 to 12bccec Compare September 26, 2025 13:18
@alamb

This comment was marked as outdated.

@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/refactor_push_decoder (4fb5ce5) to 6ecbd62 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_refactor_push_decoder
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

🤖: Benchmark completed

Details

group                                 alamb_refactor_push_decoder            main
-----                                 ---------------------------            ----
decode parquet metadata               1.01     25.7±0.30µs        ? ?/sec    1.00     25.5±0.22µs        ? ?/sec
decode parquet metadata (wide)        1.03    154.5±3.73ms        ? ?/sec    1.00    149.7±2.00ms        ? ?/sec
decode thrift file metadata           1.01     16.9±0.05µs        ? ?/sec    1.00     16.8±0.10µs        ? ?/sec
decode thrift file metadata (wide)    1.01    107.1±0.52ms        ? ?/sec    1.00    106.4±0.30ms        ? ?/sec
open(default)                         1.04     24.3±0.09µs        ? ?/sec    1.00     23.4±0.28µs        ? ?/sec
open(page index)                      1.01   1297.6±2.40µs        ? ?/sec    1.00  1279.8±10.47µs        ? ?/sec
page headers                          1.00      7.4±0.05µs        ? ?/sec    1.00      7.3±0.02µs        ? ?/sec

@alamb alamb force-pushed the alamb/refactor_push_decoder branch from 4fb5ce5 to 533f465 Compare September 26, 2025 13:47
@alamb alamb marked this pull request as ready for review September 26, 2025 14:42
@alamb alamb requested a review from etseidl September 26, 2025 14:42
metadata_size: Option<usize>,
#[cfg(feature = "encryption")]
file_decryption_properties: Option<FileDecryptionProperties>,
file_decryption_properties: Option<std::sync::Arc<FileDecryptionProperties>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FileDecryptionProperties is currently copied, which is unfortunately.

As a follow on PR, I plan to update the options elsewhere to use a Arc<FileDecryptonProperties> to avoid copies


/// API for decoding metadata that may be encrypted
#[derive(Debug, Default)]
pub(crate) struct MetadataParser {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking we can eventually use this structure as the place to hang more detailed decoding instructions (like "only decode statistics for column A" on)


/// Create a decoder with the given `ParquetMetaData` already known.
///
/// This can be used to parse and populate the page index structures
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is now a nice API to load/decode PageIndexes -- provide an existing ParquetMetadata and then this decoder figures out what bytes are needed and parses them. If we ever want to extend ParquetMetadata to include, for example, BloomFilters, we could use the same basic idea

// Get bounds needed for page indexes (if any are present in the file).
let Some(range) = self.range_for_page_index() else {
return Ok(());
let Some(metadata) = self.metadata.take() else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had hoped we would be able to remove more of the logic from ParquetMetadataReader but I couldn't figure out how to do so given the somewhat complex way it supports reading metadata even when the file length isn't known

@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

This PR is now ready for review

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

Looks like the benchmark differences are noise. I have an idea to reduce some allocations though, which I will push up here

@etseidl
Copy link
Contributor

etseidl commented Sep 26, 2025

I'm going to try merging this into my remodel branch and see what comes up.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love where this is heading! 🚀

"Parquet file has an encrypted footer but the encryption feature is disabled"
))
} else {
decode_metadata(buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only problematic line for the merge. For my initial pass I replaced this call with the thrift decode, but on second thought I should just change the implementation of decode_metadata below to use the new structs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand your description of the problem. Do you mean you inlined the contents of decode_metadata or something?

Is there anything I can do to make the pattern more amenable to the thrift-remodel branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was mostly a note to myself. When I did the merge I changed the decode_metadata call to

let mut prot = ThriftSliceInputProtocol::new(buf);
ParquetMetaData::read_thrift(&mut prot)

Instead I should do the same in parser::decode_metadata.

No changes on your end are necessary 😄

/// Parses column orders from Thrift definition.
/// If no column orders are defined, returns `None`.
pub(crate) fn parse_column_orders(
fn parse_column_orders(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will go away, btw.

@alamb
Copy link
Contributor Author

alamb commented Sep 29, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/refactor_push_decoder (12bca80) to 4d18401 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_refactor_push_decoder
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Sep 29, 2025

🤖: Benchmark completed

Details

group                                 alamb_refactor_push_decoder            main
-----                                 ---------------------------            ----
decode parquet metadata               1.53     40.3±5.80µs        ? ?/sec    1.00     26.2±0.08µs        ? ?/sec
decode parquet metadata (wide)        1.47    263.9±9.26ms        ? ?/sec    1.00   179.2±39.86ms        ? ?/sec
decode thrift file metadata           1.63     27.8±1.42µs        ? ?/sec    1.00     17.1±0.10µs        ? ?/sec
decode thrift file metadata (wide)    1.14   131.8±30.54ms        ? ?/sec    1.00   115.5±20.20ms        ? ?/sec
open(default)                         1.71     40.7±2.89µs        ? ?/sec    1.00     23.8±0.13µs        ? ?/sec
open(page index)                      1.21  1969.7±93.64µs        ? ?/sec    1.00  1623.4±353.49µs        ? ?/sec
page headers                          1.00      7.4±0.24µs        ? ?/sec    1.15      8.6±1.56µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Sep 29, 2025

🤖: Benchmark completed

🤔 those benchmark results look really bad. I will investigate

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/refactor_push_decoder (512195b) to 422da15 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_refactor_push_decoder
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2025

🤖: Benchmark completed

Details

group                             alamb_refactor_push_decoder            main
-----                             ---------------------------            ----
decode parquet metadata           1.00     24.7±0.11µs        ? ?/sec    1.02     25.3±0.09µs        ? ?/sec
decode parquet metadata (wide)    1.00    146.7±4.81ms        ? ?/sec    1.01    147.5±4.27ms        ? ?/sec
open(default)                     1.01     23.9±0.10µs        ? ?/sec    1.00     23.6±0.30µs        ? ?/sec
open(page index)                  1.01   1291.8±4.23µs        ? ?/sec    1.00   1275.6±3.97µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2025

🤖: Benchmark completed

😅 that looks much better. Let's do this!

@alamb alamb merged commit 8eca76d into apache:main Sep 30, 2025
16 checks passed
@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2025

Thanks again @etseidl

@alamb alamb deleted the alamb/refactor_push_decoder branch September 30, 2025 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] Split ParquetMetadataReader into IO/decoder state machine and thrift parsing
2 participants