[Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

Currently pyarrow's parquet writer only writes `_common_metadata` and not `_metadata`. From what I understand these are intended to contain the dataset schema but not any row group information.

 

A few (possibly naive) questions:

 

1. In the `__init__` for `ParquetDataset`, the following lines exist:
```java

if self.metadata_path is not None:
    with self.fs.open(self.metadata_path) as f:
        self.common_metadata = ParquetFile(f).metadata
else:
    self.common_metadata = None
```
I believe this should use `common_metadata_path` instead of `metadata_path`, as the latter is never written by `pyarrow`, and is given by the `_metadata` file instead of `_common_metadata` (as seemingly intended?).

 

2. In `validate_schemas` I believe an option should exist for using the schema from `_common_metadata` instead of `_metadata`, as pyarrow currently only writes the former, and as far as I can tell `_common_metadata` does include all the schema information needed.

 

Perhaps the logic in `validate_schemas` could be ported over to:

 
```java

if self.schema is not None:
    pass  # schema explicitly provided
elif self.metadata is not None:
    self.schema = self.metadata.schema
elif self.common_metadata is not None:
    self.schema = self.common_metadata.schema
else:
    self.schema = self.pieces[0].get_metadata(open_file).schema
```
If these changes are valid, I'd be happy to submit a PR. It's not 100% clear to me the difference between `_common_metadata` and `_metadata`, but I believe the schema in both should be the same. Figured I'd open this for discussion.

**Reporter**: [Jim Crist](https://issues.apache.org/jira/browse/ARROW-2079) / @jcrist
#### Related issues:
- [[Python][Dataset] Detect and use _metadata file in a list of file paths](https://github.com/apache/arrow/issues/24624) (relates to)
- [[Python] Partition columns are not correctly loaded in schema of ParquetDataset](https://github.com/apache/arrow/issues/18173) (is related to)
- [[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file](https://github.com/apache/arrow/issues/24275) (is related to)

<sub>**Note**: *This issue was originally created as [ARROW-2079](https://issues.apache.org/jira/browse/ARROW-2079). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available #18055

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python][C++] Possibly use _common_metadata for schema if _metadata isn't available #18055

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available #18055