[SPARK-17495] [SQL] Add more tests for hive hash #17049

tejasapatil · 2017-02-24T05:16:50Z

What changes were proposed in this pull request?

This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR:

null
boolean
byte
short
int
long
float
double
string
array
map
struct

Datatypes that I have NOT covered but I will work on separately are:

Decimal (handled separately in [SPARK-17495] [SQL] Support Decimal type in Hive-hash #17056)
TimestampType, DateType, CalendarIntervalType are handled in [SPARK-17495] [SQL] Support date, timestamp and interval types in Hive hash #17062

How was this patch tested?

NA

tejasapatil · 2017-02-24T05:16:56Z

ok to test

SparkQA · 2017-02-24T05:19:19Z

Test build #73395 has finished for PR 17049 at commit c589350.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-02-24T05:20:02Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

+  def checkHiveHash(value: Any, dataType: DataType, expected: Long): Unit = {
+    // Note : All expected hashes need to be computed using Hive 1.2.1
+    val actual = HiveHashFunction.hash(value, dataType, seed = 0)
+    assert(actual == expected)


we should add a clue; otherwise we will never be able to tell what's going on if the tests fail on those randomized vlaues.

withClue(s"value is $value") { assert(.. }

rxin · 2017-02-24T05:21:37Z

Looks good except that comment.

gatorsmile · 2017-02-24T05:32:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

        val length = struct.numFields
        while (i < length) {
-          result = (31 * result) + hash(struct.get(i, types(i)), types(i), seed + 1).toInt
+          result = (31 * result) + hash(struct.get(i, types(i)), types(i), 0).toInt


Could you explain the reason?

The seed is something used in murmur3 hash and hive hash does not need it. See original impl in Hive codebase : https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638

Since the methods related to hashing in Spark already had seed, I had to add it in hive-hash. When I compute the hash, I always need to set seed to 0 which is what is done here.

tejasapatil · 2017-02-24T05:45:53Z

Jenkins retest this please

SparkQA · 2017-02-24T05:47:35Z

Test build #73398 has started for PR 17049 at commit c31b2b0.

tejasapatil · 2017-02-24T08:11:59Z

Jenkins retest this please

The failure in last run was from SparkR tests. All SQL tests had passed.

rxin · 2017-02-24T17:46:19Z

Merging in master.

## What changes were proposed in this pull request? This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR: - null - boolean - byte - short - int - long - float - double - string - array - map - struct Datatypes that I have _NOT_ covered but I will work on separately are: - Decimal (handled separately in apache#17056) - TimestampType - DateType - CalendarIntervalType ## How was this patch tested? NA Author: Tejas Patil <tejasp@fb.com> Closes apache#17049 from tejasapatil/SPARK-17495_remaining_types.

Add more tests for hive hash

c589350

tejasapatil changed the title ~~[SPARK-17495] Add more tests for hive hash~~ [SPARK-17495] [SQL] Add more tests for hive hash Feb 24, 2017

tejasapatil mentioned this pull request Feb 24, 2017

[SPARK-17495] [SQL] Add Hash capability semantically equivalent to Hive's #15047

Closed

rxin reviewed Feb 24, 2017

View reviewed changes

gatorsmile reviewed Feb 24, 2017

View reviewed changes

review apache#1

c31b2b0

asfgit closed this in 3e40f6c Feb 24, 2017

tejasapatil deleted the SPARK-17495_remaining_types branch February 24, 2017 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17495] [SQL] Add more tests for hive hash #17049

[SPARK-17495] [SQL] Add more tests for hive hash #17049

Uh oh!

tejasapatil commented Feb 24, 2017 •

edited

Loading

Uh oh!

tejasapatil commented Feb 24, 2017

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

rxin Feb 24, 2017

Uh oh!

tejasapatil Feb 24, 2017

Uh oh!

rxin commented Feb 24, 2017

Uh oh!

gatorsmile Feb 24, 2017

Uh oh!

tejasapatil Feb 24, 2017

Uh oh!

tejasapatil commented Feb 24, 2017

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

tejasapatil commented Feb 24, 2017

Uh oh!

rxin commented Feb 24, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-17495] [SQL] Add more tests for hive hash #17049

[SPARK-17495] [SQL] Add more tests for hive hash #17049

Uh oh!

Conversation

tejasapatil commented Feb 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tejasapatil commented Feb 24, 2017

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

rxin Feb 24, 2017

Choose a reason for hiding this comment

Uh oh!

tejasapatil Feb 24, 2017

Choose a reason for hiding this comment

Uh oh!

rxin commented Feb 24, 2017

Uh oh!

gatorsmile Feb 24, 2017

Choose a reason for hiding this comment

Uh oh!

tejasapatil Feb 24, 2017

Choose a reason for hiding this comment

Uh oh!

tejasapatil commented Feb 24, 2017

Uh oh!

SparkQA commented Feb 24, 2017

Uh oh!

tejasapatil commented Feb 24, 2017

Uh oh!

rxin commented Feb 24, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tejasapatil commented Feb 24, 2017 •

edited

Loading