Parquet: Enable vectorized reads by default #4196

aokolnychyi · 2022-02-22T17:10:33Z

This PR enables vectorized Parquet reads by default. This feature has been available for quite some time and being used by multiple companies in prod. I do anticipate more bugs to be found when we enable this by default but I think there is sufficient confidence it will perform reasonably well in most cases.

aokolnychyi · 2022-02-22T17:42:02Z

I think the test fail because we don't support vectorized reads with INT96 timestamps (for legacy imported files).

java.lang.UnsupportedOperationException: Unsupported type: required int96 tmp_col = 2

aokolnychyi · 2022-02-22T17:43:56Z

cc @rdblue @RussellSpitzer @jackye1995 @szehon-ho @flyrain @karuppayya

What do you think? Do we have to support INT96 in the vectorized path before enabling it by default?

rdblue · 2022-02-22T17:45:27Z

It looks like we should either support INT96 vectorized reads, or turn off vectorization when we see there is an INT96 column.

aokolnychyi · 2022-02-22T17:49:31Z

I think we know there is an INT96 column only after we opened the file, which is already too late.
That means we should probably wait until INT96 columns are supported.

rdblue · 2022-02-22T17:54:06Z

I think we know there is an INT96 column only after we opened the file, which is already too late. That means we should probably wait until INT96 columns are supported.

Ah, you're right. I'd opt to ignore this, then. INT96 timestamps are not in the Iceberg spec, for exactly this reason. Iceberg progress shouldn't be held up by them.

aokolnychyi · 2022-02-22T18:01:45Z

Vectorization is a big deal and helps not only queries but also row-level operations. It would be a bit unfortunate to be blocked by this. I hope someone can work on supporting INT96. Let me see who did the original implementation. Maybe, they can work on the vectorized path too.

That being said, I am also inclined to still enable vectorization. At least, this is what we did internally.

aokolnychyi · 2022-02-22T21:25:03Z

@gustavoatt, looks like you implemented the initial support for INT96. Would you be interested in adding that support to the vectorized path? We consider enabling vectorized reads by default and it is going to cause failures for INT96 timestamps.

aokolnychyi · 2022-02-22T21:29:19Z

I've adapted the failing test for now.

rdblue · 2022-02-23T00:25:37Z

Thanks, @aokolnychyi!

aokolnychyi · 2022-02-23T00:44:04Z

Thanks, @rdblue! Created #4200 to discuss adding support for INT96 to the vectorized path.

Parquet: Enabled vectorized reads by default

a9326ae

github-actions bot added the core label Feb 22, 2022

Read INT96 timestamps without vectorization

f9d2c1e

github-actions bot added the spark label Feb 22, 2022

rdblue approved these changes Feb 23, 2022

View reviewed changes

rdblue merged commit ec2f1ad into apache:master Feb 23, 2022

arminnajafi pushed a commit to arminnajafi/iceberg that referenced this pull request Feb 23, 2022

Parquet: Enable vectorized reads by default (apache#4196)

fba472a

JonasJ-ap mentioned this pull request Jan 24, 2023

Delta: Support Snapshot Delta Lake Table to Iceberg Table #6449

Merged

yabola mentioned this pull request Mar 1, 2023

Support vectorized reading int96 timestamps in imported data #6962

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Enable vectorized reads by default #4196

Parquet: Enable vectorized reads by default #4196

aokolnychyi commented Feb 22, 2022

aokolnychyi commented Feb 22, 2022

aokolnychyi commented Feb 22, 2022

rdblue commented Feb 22, 2022

aokolnychyi commented Feb 22, 2022 •

edited

Loading

rdblue commented Feb 22, 2022 •

edited

Loading

aokolnychyi commented Feb 22, 2022 •

edited

Loading

aokolnychyi commented Feb 22, 2022

aokolnychyi commented Feb 22, 2022

rdblue commented Feb 23, 2022

aokolnychyi commented Feb 23, 2022

Parquet: Enable vectorized reads by default #4196

Parquet: Enable vectorized reads by default #4196

Conversation

aokolnychyi commented Feb 22, 2022

aokolnychyi commented Feb 22, 2022

aokolnychyi commented Feb 22, 2022

rdblue commented Feb 22, 2022

aokolnychyi commented Feb 22, 2022 • edited Loading

rdblue commented Feb 22, 2022 • edited Loading

aokolnychyi commented Feb 22, 2022 • edited Loading

aokolnychyi commented Feb 22, 2022

aokolnychyi commented Feb 22, 2022

rdblue commented Feb 23, 2022

aokolnychyi commented Feb 23, 2022

aokolnychyi commented Feb 22, 2022 •

edited

Loading

rdblue commented Feb 22, 2022 •

edited

Loading

aokolnychyi commented Feb 22, 2022 •

edited

Loading