[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

Partitioned parquet datasets sometimes come with `_metadata` / `_common_metadata` files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for `_metadata`).

Using those files during the creation of a parquet `Dataset` can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory).

Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed.   
Such logic could be put in a different factory class, eg `ParquetManifestFactory` (as suggestetd by @fsaintjacques).

**Reporter**: [Joris Van den Bossche](https://issues.apache.org/jira/browse/ARROW-8062) / @jorisvandenbossche
**Assignee**: [Francois Saint-Jacques](https://issues.apache.org/jira/browse/ARROW-8062) / @fsaintjacques
#### Related issues:
- [[Python][Dataset] Detect and use _metadata file in a list of file paths](https://github.com/apache/arrow/issues/24624) (relates to)
- [[C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error](https://github.com/apache/arrow/issues/25010) (relates to)
- [[Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available](https://github.com/apache/arrow/issues/18055) (relates to)
- [[Python] Multi-file parquet loading without scan](https://github.com/apache/arrow/issues/19586) (relates to)
- [[C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata](https://github.com/apache/arrow/issues/24885) (relates to)
#### PRs and other links:
- [GitHub Pull Request #7180](https://github.com/apache/arrow/pull/7180)

<sub>**Note**: *This issue was originally created as [ARROW-8062](https://issues.apache.org/jira/browse/ARROW-8062). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file #24275

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions