Skip to content

Conversation

@tejasapatil
Copy link
Contributor

@tejasapatil tejasapatil commented Feb 24, 2017

What changes were proposed in this pull request?

This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR:

  • null
  • boolean
  • byte
  • short
  • int
  • long
  • float
  • double
  • string
  • array
  • map
  • struct

Datatypes that I have NOT covered but I will work on separately are:

How was this patch tested?

NA

@tejasapatil
Copy link
Contributor Author

ok to test

@tejasapatil tejasapatil changed the title [SPARK-17495] Add more tests for hive hash [SPARK-17495] [SQL] Add more tests for hive hash Feb 24, 2017
@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73395 has finished for PR 17049 at commit c589350.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def checkHiveHash(value: Any, dataType: DataType, expected: Long): Unit = {
// Note : All expected hashes need to be computed using Hive 1.2.1
val actual = HiveHashFunction.hash(value, dataType, seed = 0)
assert(actual == expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a clue; otherwise we will never be able to tell what's going on if the tests fail on those randomized vlaues.

withClue(s"value is $value") {
  assert(..
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clue

@rxin
Copy link
Contributor

rxin commented Feb 24, 2017

Looks good except that comment.

val length = struct.numFields
while (i < length) {
result = (31 * result) + hash(struct.get(i, types(i)), types(i), seed + 1).toInt
result = (31 * result) + hash(struct.get(i, types(i)), types(i), 0).toInt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed is something used in murmur3 hash and hive hash does not need it. See original impl in Hive codebase : https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638

Since the methods related to hashing in Spark already had seed, I had to add it in hive-hash. When I compute the hash, I always need to set seed to 0 which is what is done here.

@tejasapatil
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73398 has started for PR 17049 at commit c31b2b0.

@tejasapatil
Copy link
Contributor Author

Jenkins retest this please

The failure in last run was from SparkR tests. All SQL tests had passed.

@rxin
Copy link
Contributor

rxin commented Feb 24, 2017

Merging in master.

@asfgit asfgit closed this in 3e40f6c Feb 24, 2017
@tejasapatil tejasapatil deleted the SPARK-17495_remaining_types branch February 24, 2017 17:52
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
## What changes were proposed in this pull request?

This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR:
- null
- boolean
- byte
- short
- int
- long
- float
- double
- string
- array
- map
- struct

Datatypes that I have _NOT_ covered but I will work on separately are:
- Decimal (handled separately in apache#17056)
- TimestampType
- DateType
- CalendarIntervalType

## How was this patch tested?

NA

Author: Tejas Patil <tejasp@fb.com>

Closes apache#17049 from tejasapatil/SPARK-17495_remaining_types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants