Skip to content

[C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata #24885

@asfimport

Description

@asfimport

Related to ARROW-8062 (as there we will also need a way to expose the global FileMetadata). But independently, it would be useful to get access to the FileMetadata on each ParquetFileFragment (eg to get access to the statistics).

This would be relatively simple to code on the Python/R side, since we have access to the file path, and could read the metadata from the file backing the fragment, and return this as a FileMetadata object.

I am wondering if we want to integrate this with ARROW-8062, since when the fragments were created from a _metadata file, a ParquetFileFragment.metadata attribute would not need to read it from the parquet file in this case, but from the global metadata (at least for eg the row group data).

Another question: what for a ParquetFileFragment that maps to a single row group?

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Ben Kietzman / @bkietz

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8733. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions