Collect `RowGroupMetaData` when writing Parquet dataset for writing `_metadata` sidecar #146

kylebarron · 2022-06-07T20:03:38Z

Some Parquet implementations use semi-standard _metadata and/or _common_metadata files to store global metadata in one place for a partitioned Parquet dataset. For example, here are the pyarrow docs on the subject.

Having these files makes it much easier to selectively read a partitioned Parquet dataset based on dataset statistics. To read any row group(s) based on column statistics, the reader needs only to:

parse the one top-level _metadata file using arrow2::io::parquet::read::read_metadata
Parse the statistics of each RowGroupMetaData in the FileMetaData
Select the row group(s) that meet your criteria
For each row group you want to read, append the file path of that column chunk to the Parquet root directory (accessible from the file_path() method on each ColumnChunkMetaData). Then read those row group(s) directly by using arrow2::io::parquet::read::read_columns_many into a Chunk.

This process removes the need for reading any/every individual Parquet file's footer, and thus removes most performance drawbacks of splitting data into multiple files.

But now I'm struggling to figure out how to write this _metadata file using arrow2. Specifically, this approach requires being able to:

Access FileMetaData/RowGroupMetaData structs that are created when writing Parquet.
Combine multiple RowGroupMetaData into a single FileMetaData (seems possible by creating the FileMetaData struct manually)
Write a metadata-only file from a FileMetaData struct. (Seems possible by copying end_file)

It's the first part that is unclear to me how to implement, without creating a new reader to read back from the file on disk. I think I'm essentially looking for a public API on arrow2::io::parquet::write::FileWriter to access the written FileMetaData. Something like if FileWriter.end() returned Result<(u64, FileMetaData)>.

For context, pyarrow's suggested process looks like

# Write a dataset and collect metadata information of all written files
metadata_collector = []
pq.write_to_dataset(table, root_path, metadata_collector=metadata_collector)

# Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / '_common_metadata')

# Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
    table.schema, root_path / '_metadata',
    metadata_collector=metadata_collector
)

That is, pyarrow.parquet.write_to_dataset and pyarrow.parquet.ParquetWriter accept a metadata_collector argument, which is just an empty list. And that list gets appended to when each row group is written to disk.

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2022-06-07T21:45:44Z

Thanks a lot for the suggestion. I hope it is ok, I moved this to parquet2 as afai understand this is more generic than arrow2.

I tried to address this in #147 . Would you be willing to review it? I am not super familiar with this functionality - I hope to rely on your expertise to judge the implementation 🙇

kylebarron · 2022-06-07T22:15:44Z

I wasn't sure whether to make an issue in arrow2 or parquet2, but it does make sense that this is entirely Parquet specific.

I think my only question is whether arrow2 would need a change after #147 in order to access the parquet2 writer? arrow2::io::parquet::write::FileWriter.writer is a private member of the struct, and FileWriter.into_inner seems to return the file, not the parquet2 FileWriter, right?

I hope to rely on your expertise to judge the implementation

I wouldn't call myself an expert, but everything seems right 😄

jorgecarleitao · 2022-06-07T22:18:36Z

exactly, we need to expose this in arrow2. It is backward compatible, so we can release a patch on this here and over there this week and have it available throughout :)

jorgecarleitao transferred this issue from jorgecarleitao/arrow2 Jun 7, 2022

jorgecarleitao mentioned this issue Jun 7, 2022

Added support to write sidecar #147

Merged

jorgecarleitao added the feature A new feature label Jun 7, 2022

jorgecarleitao closed this as completed in #147 Jun 8, 2022

jorgecarleitao added the no-changelog label Jun 10, 2022

jorgecarleitao mentioned this issue Jun 10, 2022

Added support to write parquet _metadata sidecar jorgecarleitao/arrow2#1063

Merged

kylebarron mentioned this issue Jan 28, 2023

feat(rust,python): Enable object store in scan_parquet python pola-rs/polars#6426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect `RowGroupMetaData` when writing Parquet dataset for writing `_metadata` sidecar #146

Collect `RowGroupMetaData` when writing Parquet dataset for writing `_metadata` sidecar #146

kylebarron commented Jun 7, 2022 •

edited

Loading

jorgecarleitao commented Jun 7, 2022

kylebarron commented Jun 7, 2022 •

edited

Loading

jorgecarleitao commented Jun 7, 2022

Collect RowGroupMetaData when writing Parquet dataset for writing _metadata sidecar #146

Collect RowGroupMetaData when writing Parquet dataset for writing _metadata sidecar #146

Comments

kylebarron commented Jun 7, 2022 • edited Loading

jorgecarleitao commented Jun 7, 2022

kylebarron commented Jun 7, 2022 • edited Loading

jorgecarleitao commented Jun 7, 2022

Collect `RowGroupMetaData` when writing Parquet dataset for writing `_metadata` sidecar #146

Collect `RowGroupMetaData` when writing Parquet dataset for writing `_metadata` sidecar #146

kylebarron commented Jun 7, 2022 •

edited

Loading

kylebarron commented Jun 7, 2022 •

edited

Loading