-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect RowGroupMetaData
when writing Parquet dataset for writing _metadata
sidecar
#146
Comments
Thanks a lot for the suggestion. I hope it is ok, I moved this to I tried to address this in #147 . Would you be willing to review it? I am not super familiar with this functionality - I hope to rely on your expertise to judge the implementation 🙇 |
I wasn't sure whether to make an issue in arrow2 or parquet2, but it does make sense that this is entirely Parquet specific. I think my only question is whether
I wouldn't call myself an expert, but everything seems right 😄 |
exactly, we need to expose this in arrow2. It is backward compatible, so we can release a patch on this here and over there this week and have it available throughout :) |
Some Parquet implementations use semi-standard
_metadata
and/or_common_metadata
files to store global metadata in one place for a partitioned Parquet dataset. For example, here are the pyarrow docs on the subject.Having these files makes it much easier to selectively read a partitioned Parquet dataset based on dataset statistics. To read any row group(s) based on column statistics, the reader needs only to:
_metadata
file usingarrow2::io::parquet::read::read_metadata
statistics
of eachRowGroupMetaData
in theFileMetaData
file_path()
method on eachColumnChunkMetaData
). Then read those row group(s) directly by usingarrow2::io::parquet::read::read_columns_many
into aChunk
.This process removes the need for reading any/every individual Parquet file's footer, and thus removes most performance drawbacks of splitting data into multiple files.
But now I'm struggling to figure out how to write this
_metadata
file usingarrow2
. Specifically, this approach requires being able to:FileMetaData
/RowGroupMetaData
structs that are created when writing Parquet.RowGroupMetaData
into a singleFileMetaData
(seems possible by creating theFileMetaData
struct manually)FileMetaData
struct. (Seems possible by copyingend_file
)It's the first part that is unclear to me how to implement, without creating a new reader to read back from the file on disk. I think I'm essentially looking for a public API on
arrow2::io::parquet::write::FileWriter
to access the writtenFileMetaData
. Something like ifFileWriter.end()
returnedResult<(u64, FileMetaData)>
.For context,
pyarrow
's suggested process looks likeThat is,
pyarrow.parquet.write_to_dataset
andpyarrow.parquet.ParquetWriter
accept ametadata_collector
argument, which is just an empty list. And that list gets appended to when each row group is written to disk.The text was updated successfully, but these errors were encountered: