Skip to content

Vectorized Parquet Read In Spark DataSource #90

@mccheah

Description

@mccheah

The Parquet file format reader that is available in core Spark includes a number of optimizations, the main one which is in vectorized columnar reading. In considering a potential migration from the old Spark readers to Iceberg, one would be concerned about the gap in performance that comes from lacking Spark's numerous optimizations in this space.

It is not clear what is the best way to incorporate these optimizations into Iceberg. One option would be to propose moving this code from Spark to parquet-mr. Another would be to invoke Spark's parquet reader directly here, but that is internal API. We could implement vectorized reading directly in Iceberg, but that is very much to suggest that we reinvent the wheel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions