Skip to content

Conversation

@edgarRd
Copy link
Contributor

@edgarRd edgarRd commented Jun 19, 2020

This PR attempts to solve issues #1116 and #1113 when running in different Time Zones.

I've added a test to check for local scan on different timezones which surfaced an issue with the pushdown predicates with DATE columns, which seem to use incorrect incorrect TZ to read the metrics within the ORC reader. Due to this, disabling ORC lower/upper bound metrics for Date in Iceberg as well.

I've tested by changing system TZ to other timezones and running the tests and they pass.

PTAL @shardulm94 @chenjunjiedada if you have a chance, please pull this branch and run the test suite locally just to double check the build. Thanks!

@rdblue

@edgarRd edgarRd force-pushed the orc-metrics-fix-date-conversion branch from 524e5cd to 30a5884 Compare June 19, 2020 01:02
@chenjunjiedada
Copy link
Collaborator

@edgarRd , could you please fix CI build? Looks like it doesn't pass the build.

@rdblue
Copy link
Contributor

rdblue commented Jun 19, 2020

The CI failures are style checks:

[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java:24:8: Unused import - java.sql.Date. [UnusedImports]
[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java:46:8: Unused import - org.apache.iceberg.util.DateTimeUtil. [UnusedImports]
[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java:49:8: Unused import - org.apache.orc.DateColumnStatistics. [UnusedImports]

@edgarRd edgarRd force-pushed the orc-metrics-fix-date-conversion branch from 30a5884 to dfa545b Compare June 19, 2020 18:22
@edgarRd
Copy link
Contributor Author

edgarRd commented Jun 19, 2020

Sorry about that, checkstyle errors after last change.

@chenjunjiedada
Copy link
Collaborator

@edgarRd, I ran your branch on my local machine and it passed the build, thanks for fixing!

@rdblue
Copy link
Contributor

rdblue commented Jun 22, 2020

@edgarRd, can you summarize what the problem was and what this does to fix it?

@edgarRd edgarRd force-pushed the orc-metrics-fix-date-conversion branch from dfa545b to b1fa8e6 Compare June 23, 2020 01:05
@edgarRd
Copy link
Contributor Author

edgarRd commented Jun 23, 2020

@rdblue I've simplified this PR and avoided disabling date column filter pushdown as that's not really the problem.

There're actually 2 problems, which are related:

  1. Getting the min/max stats for Date types in ORC are returned with milliseconds adjusted to the local timezone. This means that timezones ahead of epoch would have less milliseconds to the date, therefore when converting to days from epoch they would be 1 day less if EPOCH.plus is used for the conversion as this one is UTC based.
  2. In ORC the timezone used for adjusting the min/max metrics milliseconds seems to be stored at the Thread level and looks like it's not reset, hence tests switching multiple timezones fail since it appears that the TZ is not updated in ORC - Master build failed on local #1116 (comment).

This PR is solving:
a6fc688 Problem 1 - By using the returned Date object from ORC and calling toLocalDate which performs normalization to the local timezone.

I've added a comment to mention problem 2 in b1fa8e6 and fixed some tests that changed the default TZ and did not reset it to the current default one 82d46d3.

@edgarRd edgarRd changed the title ORC metrics: disable date column filter pushdown due to ORC incorrect filtering ORC: fix ORC date metrics to adjust milliseconds to local timezone Jun 23, 2020
@edgarRd edgarRd changed the title ORC: fix ORC date metrics to adjust milliseconds to local timezone ORC: fix date metrics to adjust milliseconds to local timezone Jun 23, 2020
@edgarRd
Copy link
Contributor Author

edgarRd commented Jun 26, 2020

@rdblue @shardulm94 PTAL whenever you have a chance.

@rdblue
Copy link
Contributor

rdblue commented Jun 26, 2020

Thanks for the summary, @edgarRd. I find this behavior concerning. I think it will help to cover some background, just to cover the context for specific comments.

As a storage layer, Iceberg should always read and return the exact data value that was written. For example, with floating point values, there is no "close enough" and Iceberg should read the same bits that were written. That's why we use assertEquals instead of assertions that a value is within some floating point error.

For date/time types, Iceberg should never modify a value. Concerns like translating from some time zone into the UTC value to store for timestamptz must be handled by processing engines, which handle calendars time zone offsets, and other concerns. Iceberg stores the UTC value without modification, and the data type tells the engine how to produce a value to write and consume a value that's read. This is why the internal representation is microseconds from the Unix epoch in UTC. That's unambiguous.

It looks like ORC uses representations that are timezone-dependent in some cases, like returning, and possibly calculating, stats. I think that it is good that we don't use these in the Iceberg readers or writers, and my first approach to fixing this would be to avoid having data passed using those objects.

min = Optional.ofNullable(((DateColumnStatistics) columnStats).getMinimum())
.map(minStats -> DateTimeUtil.daysFromDate(
DateTimeUtil.EPOCH.plus(minStats.getTime(), ChronoUnit.MILLIS).toLocalDate()))
.map(minStats -> DateTimeUtil.daysFromDate(((Date) minStats).toLocalDate()))
Copy link
Contributor

@rdblue rdblue Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc for Date.toLocalDate says that the LocalDate returned is "in local time zone". It is implemented like this:

return LocalDate.of(getYear() + 1900, getMonth() + 1, getDate());

And, those getter methods use the current zone. So this cannot be correct because the value is dependent on the local zone. I think your intent was to account for the adjustment made by ORC that you referenced by saying "Getting the min/max stats for Date types in ORC are returned with milliseconds adjusted to the local timezone."

I think the main problem is that this uses a path where the value is modified by ORC based on the JVM environment, which breaks a fundamental rule in Iceberg. I don't think that we should use a path where the value is modified. It's just too hard to guarantee that although the value was modified, we were able to correctly undo it.

Copy link
Contributor Author

@edgarRd edgarRd Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your intent was to account for the adjustment made by ORC that you referenced by saying "Getting the min/max stats for Date types in ORC are returned with milliseconds adjusted to the local timezone."

Yes, ORC modifies the internally stored stats and returns the values dependent on the timezone. These methods adjust for that.

I think the main problem is that this uses a path where the value is modified by ORC based on the JVM environment, which breaks a fundamental rule in Iceberg. I don't think that we should use a path where the value is modified. It's just too hard to guarantee that although the value was modified, we were able to correctly undo it.

I agree, ORC in this respect violates the rule and modifies the stored value. I'm okay with not using this path, and therefore removing Date metrics collection if that's okay. Does that sound good?

Copy link
Contributor

@rdblue rdblue Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with either of two options:

  1. Don't collect date metrics until this is fixed in ORC
  2. Use reflection to pull the minimum and maximum private int fields out of the stats

Option 2 seems reasonable enough to me, since the stats actually store the int min and max values that we're looking for.

@rdblue
Copy link
Contributor

rdblue commented Jun 26, 2020

I just had a chat with @omalley about this issue and he's going to add a getter for the underlying values, getMinimumDays and getMaximumDays. That way, we won't need to go through a type with behavior dependent on the time zone and we will avoid the adjustment problems.

@rdblue
Copy link
Contributor

rdblue commented Jun 27, 2020

@edgarRd, I created edgarRd#1 against your branch with the reflection fix. It passes all of the tests in this PR and the stats tests.

@edgarRd
Copy link
Contributor Author

edgarRd commented Jun 27, 2020

@rdblue great! I've merged the change. Thanks!

@rdblue rdblue merged commit 4e282c3 into apache:master Jun 27, 2020
@rdblue
Copy link
Contributor

rdblue commented Jun 27, 2020

Merged to master. Thanks @edgarRd!

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants