-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
4e8b309
Stash.
mbutrovich 958050c
Stash.
mbutrovich 2dbcfbf
Checkpoint.
mbutrovich 7dd593d
update arrow
mbutrovich ad32c9f
Fix after merging main.
mbutrovich 6fad37b
Merge branch 'main' into int96_again
mbutrovich 72b2ac7
Merge branch 'main' into int96_again
mbutrovich 4bfdc92
Add test for int96_from_spark.
mbutrovich f080f83
Remove commented out code.
mbutrovich 69ed7d4
Update parquet-testing to include int96_from_spark.parquet.
mbutrovich b1a31ba
Add more test scenarios. Need to understand why ms and s look wrong f…
mbutrovich 8f372a1
Adjust test scenarios.
mbutrovich f5f41ff
Merge branch 'main' into int96_again
mbutrovich 796c5f1
Fix sqlllogic tests.
mbutrovich f9227c9
Update docs.
mbutrovich e8e0324
clippy
mbutrovich 4a48ff9
Fix docs.
mbutrovich b75a960
Adjust config wording based on PR feedback.
mbutrovich 83a1397
Update comment.
mbutrovich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is any usecase for int96 other than timestamps.
Specifically, maybe we can simply always change the behavior and coerce int96 --> microseconds
At the very least default the option to be enabled perhaps
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as far as I know, but I don't think the (deprecated) int96 spec said that it had to represent a timestamp. It's just where Spark, Hive, Impala, etc. ended up.
It's not clear to me if we should assume that an int96 originated from a system that treated the originating timestamp it as microseconds. While it's very likely that it originated from one of those systems, I don't know how to treat the default in this case. Snowflake, for example, seems to use microseconds for its timestamps when dealing with Iceberg:
https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types#supported-data-types-for-iceberg-tables
I'm hesitant to mess with defaults, but am open to hearing more from the community. @parthchandra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how dominant spark is and how rarely used int96 is outside the spark ecosystem, I was thinking that basically if anyone had such a file it is likely we should treat the values as microseconds.
I don't have a strong preference, I was just trying to come up with a way to keep the code less compilcated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the use of int96 originated from Impala/Parquet-cpp where it was used to store nanoseconds (The C++ implementation came from the Impala team). I think the Java implementation ended up with int96 in order to be compatible. Spark came along with its own variant and well, here we are.
(https://issues.apache.org/jira/browse/PARQUET-323)
The Parquet community assumed that this was the only usage of int96 before it was deprecated so I feel it is a safe for us to assume the same.
It can be done as a follow up, though, I feel.