Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Enable vectorized reads by default #4196

Merged
merged 2 commits into from
Feb 23, 2022

Conversation

aokolnychyi
Copy link
Contributor

This PR enables vectorized Parquet reads by default. This feature has been available for quite some time and being used by multiple companies in prod. I do anticipate more bugs to be found when we enable this by default but I think there is sufficient confidence it will perform reasonably well in most cases.

@github-actions github-actions bot added the core label Feb 22, 2022
@aokolnychyi
Copy link
Contributor Author

I think the test fail because we don't support vectorized reads with INT96 timestamps (for legacy imported files).

java.lang.UnsupportedOperationException: Unsupported type: required int96 tmp_col = 2

@aokolnychyi
Copy link
Contributor Author

cc @rdblue @RussellSpitzer @jackye1995 @szehon-ho @flyrain @karuppayya

What do you think? Do we have to support INT96 in the vectorized path before enabling it by default?

@rdblue
Copy link
Contributor

rdblue commented Feb 22, 2022

It looks like we should either support INT96 vectorized reads, or turn off vectorization when we see there is an INT96 column.

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Feb 22, 2022

I think we know there is an INT96 column only after we opened the file, which is already too late.
That means we should probably wait until INT96 columns are supported.

@rdblue
Copy link
Contributor

rdblue commented Feb 22, 2022

I think we know there is an INT96 column only after we opened the file, which is already too late. That means we should probably wait until INT96 columns are supported.

Ah, you're right. I'd opt to ignore this, then. INT96 timestamps are not in the Iceberg spec, for exactly this reason. Iceberg progress shouldn't be held up by them.

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Feb 22, 2022

Vectorization is a big deal and helps not only queries but also row-level operations. It would be a bit unfortunate to be blocked by this. I hope someone can work on supporting INT96. Let me see who did the original implementation. Maybe, they can work on the vectorized path too.

That being said, I am also inclined to still enable vectorization. At least, this is what we did internally.

@aokolnychyi
Copy link
Contributor Author

@gustavoatt, looks like you implemented the initial support for INT96. Would you be interested in adding that support to the vectorized path? We consider enabling vectorized reads by default and it is going to cause failures for INT96 timestamps.

@github-actions github-actions bot added the spark label Feb 22, 2022
@aokolnychyi
Copy link
Contributor Author

I've adapted the failing test for now.

@rdblue rdblue merged commit ec2f1ad into apache:master Feb 23, 2022
@rdblue
Copy link
Contributor

rdblue commented Feb 23, 2022

Thanks, @aokolnychyi!

@aokolnychyi
Copy link
Contributor Author

Thanks, @rdblue! Created #4200 to discuss adding support for INT96 to the vectorized path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants