-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29275: Stats autogather calculates the min statistic incorrectly #6177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement an optimized compare method for Decimal
d00faae to
584fed5
Compare
|
...ore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/DecimalComparator.java
Show resolved
Hide resolved
...ore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/DecimalComparator.java
Show resolved
Hide resolved
...server/src/test/java/org/apache/hadoop/hive/metastore/columnstats/DecimalComparatorTest.java
Show resolved
Hide resolved
| int compareInner(Decimal d1, Decimal d2, boolean useFallback, ApproachInfoCallback aic) { | ||
| byte[] unscaled1 = d1.getUnscaled(); | ||
| byte[] unscaled2 = d2.getUnscaled(); | ||
| boolean notNegative1 = isNotNegative(unscaled1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about naming on overall: isn't it better to understand isPositive instead of isNotNegative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the same:
- isPositive: 0 < x
- isNotNegative 0 <= x
So the question is how 0 is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately isPositive is not the same as isNotNegative, as isPositive(0) is false, while isNotNegative(0) is true. At least that's the standard use of the terms.
|
|
||
| @Override | ||
| public int compare(Decimal o1, Decimal o2) { | ||
| return compareInner(o1, o2, true, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is the entry point of the comparing itself. Is it possible to add some traditional checks like null checking and reference equality check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add the checks.
...server/src/test/java/org/apache/hadoop/hive/metastore/columnstats/DecimalComparatorTest.java
Show resolved
Hide resolved
|
|
||
| HiveDecimal result = HiveDecimal.create(new BigInteger(val.getUnscaled()), val.getScale()); | ||
| return (result != null) ? result.toString() : ""; | ||
| return MetaStoreServerUtils.decimalToString(val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason of moving this logic to MetaStoreServerUtils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShowUtils is not accessible from DecimalColumnStatsMergerTest. Moving the logic of the method to MetaStoreServerUtils allows it to use it in the test as well, avoiding duplication of the same logic in different places.
...ore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/DecimalComparator.java
Show resolved
Hide resolved
...ore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/DecimalComparator.java
Show resolved
Hide resolved
|
Thank you for the reviews! As @kasakrisz observed, the comparator will not be called frequently, so maybe the custom comparator is not justifiable. I'll prepare another PR using a simple HiveDecimal.compareTo, once #6189 has been merged (as both deal with statistics). |
...ore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/DecimalComparator.java
Show resolved
Hide resolved
| * | ||
| * <p>In that case it is enough to compare their unscaled value (or significand). | ||
| */ | ||
| private static int compareSameScale(byte pad, byte[] b1, byte[] b2, ApproachInfoCallback aic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add some explanation about the algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what there is to say besides that the algorithm "compare[s] their unscaled value".



What changes were proposed in this pull request?
As described in HIVE-29275, the compareTo method of Decimal leads to wrong result. This PR proposes a custom comparator handling many cases with a fallback to BigDecimal's compareTo.
Why are the changes needed?
Fixes the stats processed by DecimalColumnStatsMerger.
Does this PR introduce any user-facing change?
Yes, might change the statistics.
How was this patch tested?
Unit tests, qfile tests.