Skip to content

[Python][C++] Possibly use _common_metadata for schema if _metadata isn't available #18055

@asfimport

Description

@asfimport

Currently pyarrow's parquet writer only writes _common_metadata and not _metadata. From what I understand these are intended to contain the dataset schema but not any row group information.

 

A few (possibly naive) questions:

 

  1. In the __init__ for ParquetDataset, the following lines exist:
if self.metadata_path is not None:
    with self.fs.open(self.metadata_path) as f:
        self.common_metadata = ParquetFile(f).metadata
else:
    self.common_metadata = None

I believe this should use common_metadata_path instead of metadata_path, as the latter is never written by pyarrow, and is given by the _metadata file instead of _common_metadata (as seemingly intended?).

 

  1. In validate_schemas I believe an option should exist for using the schema from _common_metadata instead of _metadata, as pyarrow currently only writes the former, and as far as I can tell _common_metadata does include all the schema information needed.

 

Perhaps the logic in validate_schemas could be ported over to:

 

if self.schema is not None:
    pass  # schema explicitly provided
elif self.metadata is not None:
    self.schema = self.metadata.schema
elif self.common_metadata is not None:
    self.schema = self.common_metadata.schema
else:
    self.schema = self.pieces[0].get_metadata(open_file).schema

If these changes are valid, I'd be happy to submit a PR. It's not 100% clear to me the difference between _common_metadata and _metadata, but I believe the schema in both should be the same. Figured I'd open this for discussion.

Reporter: Jim Crist / @jcrist

Related issues:

Note: This issue was originally created as ARROW-2079. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions