[Python][Dataset] Detect and use _metadata file in a list of file paths

From https://github.com/dask/dask/pull/6047#discussion_r402391318

When specifying a directory to `ParquetDataset`, we will detect if a `_metadata` file is present in the directory and use that to populate the `metadata` attribute (and not include this file in the list of "pieces", since it does not include any data).
 
However, when passing a list of files to `ParquetDataset`, with one being "_metadata", the metadata attribute is not populated, and the "_metadata" path is included as one of the ParquetDatasetPiece objects instead (which leads to an ArrowIOError during the read of that piece).

We _could_ detect it in a list of paths as well.

Note, I mentioned `ParquetDataset`, but if working on this, we should probably directly do it in the datasets API-based version.  
Also, I labeled this as Python and not C++ for now, as this might be something that can be handled on the Python side (once the C++ side knows how to process this kind of metadata -> ARROW-8062)

**Reporter**: [Joris Van den Bossche](https://issues.apache.org/jira/browse/ARROW-8446) / @jorisvandenbossche
#### Related issues:
- [[Python][C++][Dataset] Implement split_row_groups for ParquetDataset](https://github.com/apache/arrow/issues/19181) (relates to)
- [[Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available](https://github.com/apache/arrow/issues/18055) (is related to)
- [[Python][C++] Document how to write _metadata, _common_metadata files with Parquet datasets](https://github.com/apache/arrow/issues/19502) (is related to)
- [[C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file](https://github.com/apache/arrow/issues/24275) (is related to)

<sub>**Note**: *This issue was originally created as [ARROW-8446](https://issues.apache.org/jira/browse/ARROW-8446). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Dataset] Detect and use _metadata file in a list of file paths #24624

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python][Dataset] Detect and use _metadata file in a list of file paths #24624

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions