Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect RowGroupMetaData when writing Parquet dataset for writing _metadata sidecar #146

Closed
kylebarron opened this issue Jun 7, 2022 · 3 comments · Fixed by #147
Closed
Labels

Comments

@kylebarron
Copy link
Contributor

kylebarron commented Jun 7, 2022

Some Parquet implementations use semi-standard _metadata and/or _common_metadata files to store global metadata in one place for a partitioned Parquet dataset. For example, here are the pyarrow docs on the subject.

Having these files makes it much easier to selectively read a partitioned Parquet dataset based on dataset statistics. To read any row group(s) based on column statistics, the reader needs only to:

  1. parse the one top-level _metadata file using arrow2::io::parquet::read::read_metadata
  2. Parse the statistics of each RowGroupMetaData in the FileMetaData
  3. Select the row group(s) that meet your criteria
  4. For each row group you want to read, append the file path of that column chunk to the Parquet root directory (accessible from the file_path() method on each ColumnChunkMetaData). Then read those row group(s) directly by using arrow2::io::parquet::read::read_columns_many into a Chunk.

This process removes the need for reading any/every individual Parquet file's footer, and thus removes most performance drawbacks of splitting data into multiple files.

But now I'm struggling to figure out how to write this _metadata file using arrow2. Specifically, this approach requires being able to:

  1. Access FileMetaData/RowGroupMetaData structs that are created when writing Parquet.
  2. Combine multiple RowGroupMetaData into a single FileMetaData (seems possible by creating the FileMetaData struct manually)
  3. Write a metadata-only file from a FileMetaData struct. (Seems possible by copying end_file)

It's the first part that is unclear to me how to implement, without creating a new reader to read back from the file on disk. I think I'm essentially looking for a public API on arrow2::io::parquet::write::FileWriter to access the written FileMetaData. Something like if FileWriter.end() returned Result<(u64, FileMetaData)>.

For context, pyarrow's suggested process looks like

# Write a dataset and collect metadata information of all written files
metadata_collector = []
pq.write_to_dataset(table, root_path, metadata_collector=metadata_collector)

# Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / '_common_metadata')

# Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
    table.schema, root_path / '_metadata',
    metadata_collector=metadata_collector
)

That is, pyarrow.parquet.write_to_dataset and pyarrow.parquet.ParquetWriter accept a metadata_collector argument, which is just an empty list. And that list gets appended to when each row group is written to disk.

@jorgecarleitao jorgecarleitao transferred this issue from jorgecarleitao/arrow2 Jun 7, 2022
@jorgecarleitao
Copy link
Owner

Thanks a lot for the suggestion. I hope it is ok, I moved this to parquet2 as afai understand this is more generic than arrow2.

I tried to address this in #147 . Would you be willing to review it? I am not super familiar with this functionality - I hope to rely on your expertise to judge the implementation 🙇

@jorgecarleitao jorgecarleitao added the feature A new feature label Jun 7, 2022
@kylebarron
Copy link
Contributor Author

kylebarron commented Jun 7, 2022

I wasn't sure whether to make an issue in arrow2 or parquet2, but it does make sense that this is entirely Parquet specific.

I think my only question is whether arrow2 would need a change after #147 in order to access the parquet2 writer? arrow2::io::parquet::write::FileWriter.writer is a private member of the struct, and FileWriter.into_inner seems to return the file, not the parquet2 FileWriter, right?

I hope to rely on your expertise to judge the implementation

I wouldn't call myself an expert, but everything seems right 😄

@jorgecarleitao
Copy link
Owner

exactly, we need to expose this in arrow2. It is backward compatible, so we can release a patch on this here and over there this week and have it available throughout :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants