-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Currently pyarrow's parquet writer only writes _common_metadata and not _metadata. From what I understand these are intended to contain the dataset schema but not any row group information.
A few (possibly naive) questions:
- In the
__init__forParquetDataset, the following lines exist:
if self.metadata_path is not None:
with self.fs.open(self.metadata_path) as f:
self.common_metadata = ParquetFile(f).metadata
else:
self.common_metadata = NoneI believe this should use common_metadata_path instead of metadata_path, as the latter is never written by pyarrow, and is given by the _metadata file instead of _common_metadata (as seemingly intended?).
- In
validate_schemasI believe an option should exist for using the schema from_common_metadatainstead of_metadata, as pyarrow currently only writes the former, and as far as I can tell_common_metadatadoes include all the schema information needed.
Perhaps the logic in validate_schemas could be ported over to:
if self.schema is not None:
pass # schema explicitly provided
elif self.metadata is not None:
self.schema = self.metadata.schema
elif self.common_metadata is not None:
self.schema = self.common_metadata.schema
else:
self.schema = self.pieces[0].get_metadata(open_file).schemaIf these changes are valid, I'd be happy to submit a PR. It's not 100% clear to me the difference between _common_metadata and _metadata, but I believe the schema in both should be the same. Figured I'd open this for discussion.
Related issues:
- [Python][Dataset] Detect and use _metadata file in a list of file paths (relates to)
- [Python] Partition columns are not correctly loaded in schema of ParquetDataset (is related to)
- [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file (is related to)
Note: This issue was originally created as ARROW-2079. Please see the migration documentation for further details.