-
Notifications
You must be signed in to change notification settings - Fork 2.9k
ORC: fix date metrics to adjust milliseconds to local timezone #1127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC: fix date metrics to adjust milliseconds to local timezone #1127
Conversation
524e5cd to
30a5884
Compare
|
@edgarRd , could you please fix CI build? Looks like it doesn't pass the build. |
|
The CI failures are style checks: |
30a5884 to
dfa545b
Compare
|
Sorry about that, checkstyle errors after last change. |
|
@edgarRd, I ran your branch on my local machine and it passed the build, thanks for fixing! |
|
@edgarRd, can you summarize what the problem was and what this does to fix it? |
dfa545b to
b1fa8e6
Compare
|
@rdblue I've simplified this PR and avoided disabling date column filter pushdown as that's not really the problem. There're actually 2 problems, which are related:
This PR is solving: I've added a comment to mention problem 2 in b1fa8e6 and fixed some tests that changed the default TZ and did not reset it to the current default one 82d46d3. |
|
@rdblue @shardulm94 PTAL whenever you have a chance. |
|
Thanks for the summary, @edgarRd. I find this behavior concerning. I think it will help to cover some background, just to cover the context for specific comments. As a storage layer, Iceberg should always read and return the exact data value that was written. For example, with floating point values, there is no "close enough" and Iceberg should read the same bits that were written. That's why we use For date/time types, Iceberg should never modify a value. Concerns like translating from some time zone into the UTC value to store for It looks like ORC uses representations that are timezone-dependent in some cases, like returning, and possibly calculating, stats. I think that it is good that we don't use these in the Iceberg readers or writers, and my first approach to fixing this would be to avoid having data passed using those objects. |
| min = Optional.ofNullable(((DateColumnStatistics) columnStats).getMinimum()) | ||
| .map(minStats -> DateTimeUtil.daysFromDate( | ||
| DateTimeUtil.EPOCH.plus(minStats.getTime(), ChronoUnit.MILLIS).toLocalDate())) | ||
| .map(minStats -> DateTimeUtil.daysFromDate(((Date) minStats).toLocalDate())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc for Date.toLocalDate says that the LocalDate returned is "in local time zone". It is implemented like this:
return LocalDate.of(getYear() + 1900, getMonth() + 1, getDate());And, those getter methods use the current zone. So this cannot be correct because the value is dependent on the local zone. I think your intent was to account for the adjustment made by ORC that you referenced by saying "Getting the min/max stats for Date types in ORC are returned with milliseconds adjusted to the local timezone."
I think the main problem is that this uses a path where the value is modified by ORC based on the JVM environment, which breaks a fundamental rule in Iceberg. I don't think that we should use a path where the value is modified. It's just too hard to guarantee that although the value was modified, we were able to correctly undo it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your intent was to account for the adjustment made by ORC that you referenced by saying "Getting the min/max stats for Date types in ORC are returned with milliseconds adjusted to the local timezone."
Yes, ORC modifies the internally stored stats and returns the values dependent on the timezone. These methods adjust for that.
I think the main problem is that this uses a path where the value is modified by ORC based on the JVM environment, which breaks a fundamental rule in Iceberg. I don't think that we should use a path where the value is modified. It's just too hard to guarantee that although the value was modified, we were able to correctly undo it.
I agree, ORC in this respect violates the rule and modifies the stored value. I'm okay with not using this path, and therefore removing Date metrics collection if that's okay. Does that sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with either of two options:
- Don't collect date metrics until this is fixed in ORC
- Use reflection to pull the
minimumandmaximumprivate int fields out of the stats
Option 2 seems reasonable enough to me, since the stats actually store the int min and max values that we're looking for.
|
I just had a chat with @omalley about this issue and he's going to add a getter for the underlying values, |
|
@rdblue great! I've merged the change. Thanks! |
|
Merged to master. Thanks @edgarRd! |
This PR attempts to solve issues #1116 and #1113 when running in different Time Zones.
I've added a test to check for local scan on different timezones which surfaced an issue with the pushdown predicates with DATE columns, which seem to use incorrect incorrect TZ to read the metrics within the ORC reader. Due to this, disabling ORC lower/upper bound metrics for Date in Iceberg as well.
I've tested by changing system TZ to other timezones and running the tests and they pass.
PTAL @shardulm94 @chenjunjiedada if you have a chance, please pull this branch and run the test suite locally just to double check the build. Thanks!
@rdblue