Vectorized Parquet Read In Spark DataSource

The Parquet file format reader that is available in core Spark includes a number of optimizations, the main one which is in vectorized columnar reading. In considering a potential migration from the old Spark readers to Iceberg, one would be concerned about the gap in performance that comes from lacking Spark's numerous optimizations in this space.

It is not clear what is the best way to incorporate these optimizations into Iceberg. One option would be to propose moving this code from Spark to parquet-mr. Another would be to invoke Spark's parquet reader directly here, but that is internal API. We could implement vectorized reading directly in Iceberg, but that is very much to suggest that we reinvent the wheel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorized Parquet Read In Spark DataSource #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vectorized Parquet Read In Spark DataSource #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions