-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Enable vectorized reads by default #4196
Conversation
I think the test fail because we don't support vectorized reads with INT96 timestamps (for legacy imported files).
|
cc @rdblue @RussellSpitzer @jackye1995 @szehon-ho @flyrain @karuppayya What do you think? Do we have to support INT96 in the vectorized path before enabling it by default? |
It looks like we should either support INT96 vectorized reads, or turn off vectorization when we see there is an INT96 column. |
I think we know there is an INT96 column only after we opened the file, which is already too late. |
Ah, you're right. I'd opt to ignore this, then. INT96 timestamps are not in the Iceberg spec, for exactly this reason. Iceberg progress shouldn't be held up by them. |
Vectorization is a big deal and helps not only queries but also row-level operations. It would be a bit unfortunate to be blocked by this. I hope someone can work on supporting INT96. Let me see who did the original implementation. Maybe, they can work on the vectorized path too. That being said, I am also inclined to still enable vectorization. At least, this is what we did internally. |
@gustavoatt, looks like you implemented the initial support for INT96. Would you be interested in adding that support to the vectorized path? We consider enabling vectorized reads by default and it is going to cause failures for INT96 timestamps. |
I've adapted the failing test for now. |
Thanks, @aokolnychyi! |
This PR enables vectorized Parquet reads by default. This feature has been available for quite some time and being used by multiple companies in prod. I do anticipate more bugs to be found when we enable this by default but I think there is sufficient confidence it will perform reasonably well in most cases.