-
Notifications
You must be signed in to change notification settings - Fork 462
PARQUET-686: Clarifications about min-max stats. #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a custom sort order for Impala timestamp values until
Int96is removed from the parquet format?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember in which context the discussion was but I think INT96 timestamps should be sortable correctly with signed comparison.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are referring to Impala timestamps, then I believe signed comparison is not sufficient.
Quoting from https://github.com/cloudera/Impala/blob/b402e342d42b60ff3d01e87d83e9bfba635488cf/tests/util/get_parquet_metadata.py:75
The comparator should compare the date value first (last 4 bytes) and then the nanoseconds (first 8 bytes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that probably someone else also should review (cc @rdblue ;) ) but I guess that due to storing the values as little endian, this should be correct. Please don't take this for granted, I hadn't had to deal with endianness in the last years explicitly, so I might have good this wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt if little endian layout that works at the byte level can help with this multi-byte value comparison.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the timestamp '2000-01-01 12:34:56' stored as an int96:
Since 117253024523396126668760320 = 0x60FD4B3229000059682500, the 12 bytes are 00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the time and the date parts.
00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 0x000029324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 minutes + 56 seconds.
59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 2451545 as the Julian day number, which corresponds to 2000-01-01.
To correctly sort these values without interpreting them as timestamps, the bigger unit (date) should precede the smaller unit (time). In this case they are in the opposite order, but they are also stored with little-endian byte-order (individually), which means that they will be in the correct order if we interpret the whole value in a little-endian manner. So, for correct ordering based purely on numerical value, in comparisons the example above should not be interpreted as 0x0060FD4B3229000059682500 = 117253024523396126668760320 like parquet-tools did, but as 0x00256859000029324BFD6000 = 45223023200227578716446720 instead. Or, to put it more simply, a byte-by-byte comparison starting from the end of the values results in the correct ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification! An Int96 intrinsic hardware type would have handled the value as is. Otherwise a byte-by-byte comparison in the reverse order is needed.