-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17495] [SQL] Add more tests for hive hash #17049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17495] [SQL] Add more tests for hive hash #17049
Conversation
|
ok to test |
|
Test build #73395 has finished for PR 17049 at commit
|
| def checkHiveHash(value: Any, dataType: DataType, expected: Long): Unit = { | ||
| // Note : All expected hashes need to be computed using Hive 1.2.1 | ||
| val actual = HiveHashFunction.hash(value, dataType, seed = 0) | ||
| assert(actual == expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a clue; otherwise we will never be able to tell what's going on if the tests fail on those randomized vlaues.
withClue(s"value is $value") {
assert(..
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added clue
|
Looks good except that comment. |
| val length = struct.numFields | ||
| while (i < length) { | ||
| result = (31 * result) + hash(struct.get(i, types(i)), types(i), seed + 1).toInt | ||
| result = (31 * result) + hash(struct.get(i, types(i)), types(i), 0).toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seed is something used in murmur3 hash and hive hash does not need it. See original impl in Hive codebase : https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638
Since the methods related to hashing in Spark already had seed, I had to add it in hive-hash. When I compute the hash, I always need to set seed to 0 which is what is done here.
|
Jenkins retest this please |
|
Test build #73398 has started for PR 17049 at commit |
|
Jenkins retest this please The failure in last run was from SparkR tests. All SQL tests had passed. |
|
Merging in master. |
## What changes were proposed in this pull request? This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR: - null - boolean - byte - short - int - long - float - double - string - array - map - struct Datatypes that I have _NOT_ covered but I will work on separately are: - Decimal (handled separately in apache#17056) - TimestampType - DateType - CalendarIntervalType ## How was this patch tested? NA Author: Tejas Patil <tejasp@fb.com> Closes apache#17049 from tejasapatil/SPARK-17495_remaining_types.
What changes were proposed in this pull request?
This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR:
Datatypes that I have NOT covered but I will work on separately are:
How was this patch tested?
NA