[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter #20793

MaxGekk · 2018-03-10T15:10:30Z

What changes were proposed in this pull request?

The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs.

How was this patch tested?

By running the existing tests - XORShiftRandomSuite

kiszk · 2018-03-10T17:10:53Z

Good catch, LGTM

felixcheung · 2018-03-11T01:55:46Z

Jenkins, ok to test

viirya · 2018-03-11T04:33:40Z

Does hashSeed method produce same hash value after this change?

scala> def hashSeed(seed: Long): Long = {
     |   val bytes = ByteBuffer.allocate(java.lang.Long.SIZE).putLong(seed).array()
     |   val lowBits = MurmurHash3.bytesHash(bytes)
     |   val highBits = MurmurHash3.bytesHash(bytes, lowBits)
     |   (highBits.toLong << 32) | (lowBits.toLong & 0xFFFFFFFFL)
     | }
hashSeed: (seed: Long)Long

scala> hashSeed(100)
res3: Long = 852394178374189935

scala> def hashSeed2(seed: Long): Long = {
     |   val bytes = ByteBuffer.allocate(java.lang.Long.BYTES).putLong(seed).array()
     |   val lowBits = MurmurHash3.bytesHash(bytes)
     |   val highBits = MurmurHash3.bytesHash(bytes, lowBits)
     |   (highBits.toLong << 32) | (lowBits.toLong & 0xFFFFFFFFL)
     | }
hashSeed2: (seed: Long)Long
scala> hashSeed2(100)
res7: Long = 1088402058313200430

SparkQA · 2018-03-11T05:11:29Z

Test build #88156 has finished for PR 20793 at commit bb40ef2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-03-11T07:15:58Z

Ah, results are different since the number of operations are different. It may be an issue like #20630.

I am curious why test are failure when seed is changed. Of course, I understand the sequence of rand must be reproducable with certain seed value in a package or implementation.

MaxGekk · 2018-03-11T08:56:02Z

At least some tests expect that particular values would be result of sample/random: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L550-L564 .

MaxGekk · 2018-03-11T09:07:20Z

The question is that existing output of pseudo random/sample is guaranteed by public API or not? Probably not. Here was an attempt to make tests tolerant to seed: #8314

… expected

SparkQA · 2018-03-11T15:10:33Z

Test build #88160 has finished for PR 20793 at commit 177afcc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-17T19:56:14Z

I am closing the PR because it changes external behavior. Maybe I will create new one for Spark 3.0

…t not be expected" This reverts commit 177afcc.

SparkQA · 2019-02-08T08:05:01Z

Test build #102090 has finished for PR 20793 at commit 177afcc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-08T08:05:02Z

Test build #102091 has finished for PR 20793 at commit 0d18fcd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-02-08T14:21:11Z

jenkins, retest this, please

SparkQA · 2019-02-08T18:25:55Z

Test build #102101 has finished for PR 20793 at commit 0d18fcd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-02-08T19:13:53Z

@srowen Somehow many ML tests depend on hashSeed. I will change expected numbers in tests like DataFrameStatSuite.sampleBy but I am wondering why other (especially ML) tests depend on the seed.

srowen · 2019-02-08T19:18:23Z

The tests need to be deterministic, but you are correct that some tests go a little too far in asserting the exact result of the random sampling. If the result is still correct, it is reasonable to just update the new expected value. But it could be fine to loosen some tests too if they are overly specific without much value.

srowen · 2019-02-08T21:13:55Z

For example, tests like this ...
org.scalatest.exceptions.TestFailedException: Expected 2.7115417862490827 and 2.7355261 to be within 0.001 using relative tolerance.
.. should just be resolved by increasing the tolerance, probably in all cases. We can review to be sure. But it's evidence that the (arbitrary) tolerance was too tight as a correct implementation might vary its result by more, depending on the random seed.

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

SparkQA · 2019-03-21T14:29:13Z

Test build #103768 has started for PR 20793 at commit c758f57.

shaneknapp · 2019-03-21T20:01:56Z

test this please

SparkQA · 2019-03-21T20:25:21Z

Test build #103787 has finished for PR 20793 at commit c758f57.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-22T04:08:07Z

Test build #103795 has finished for PR 20793 at commit 0622e96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Almost there, just a few more items to check here

mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala

python/pyspark/ml/tests/test_algorithms.py

srowen · 2019-03-22T10:48:43Z

R/pkg/tests/fulltests/test_mllib_classification.R

  mlpTestDF <- df
  mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
-  expect_equal(head(mlpPredictions$prediction, 6), c("0.0", "1.0", "1.0", "1.0", "1.0", "1.0"))
+  expect_equal(head(mlpPredictions$prediction, 6), c("2.0", "2.0", "2.0", "2.0", "2.0", "2.0"))


This change concerns me; those predictions are all wrong now according to the data. It probably means the test was insufficient to begin with. I think the tolerance parameter is way too high; unset it if possible or use a much smaller value like 0.00001

I tried to set tol = 0.00001:

> head(summary$weights, 5) [[1]] [1] -24.28415 [[2]] [1] 107.8701 [[3]] [1] 16.86376 [[4]] [1] 1.103736 [[5]] [1] 9.244488 > mlpTestDF <- df > mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) > head(mlpPredictions$prediction, 6) [1] "1.0" "1.0" "1.0" "1.0" "0.0" "1.0"

If it is ok, I will commit this.

Much better -- those are actually the correct answers!

SparkQA · 2019-03-22T20:44:26Z

Test build #103827 has finished for PR 20793 at commit 3754ede.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-23T02:52:33Z

Test build #103838 has finished for PR 20793 at commit 5774ad6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-03-23T16:26:31Z

Merged to master. That was a big difficult change, thank you. This doesn't just fix the seed issue but cleans up some tests along the way.

MaxGekk · 2019-03-23T16:32:11Z

Thank you to everybody for your reviews, especially @srowen that you encouraged to finish the PR.

…ize of the seed parameter The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs. By running the existing tests - XORShiftRandomSuite Closes apache#20793 from MaxGekk/hash-buff-size. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

### What changes were proposed in this pull request? This PR fixes the examples of `rand` and `randn`. ### Why are the changes needed? SPARK-23643 (#20793) fixes an issue which is related to the seed and it causes the result of `rand` and `randn`. Now the results of `SELECT rand(0)` and `SELECT randn((null)` are `0.7604953758285915` and `1.6034991609278433` respectively, and they should be deterministic because the number of partitions are always 1 (the leaf node is `OneRowRelation`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc and confirmed it. ![rand-doc](https://user-images.githubusercontent.com/4736016/121359059-145a9b80-c96e-11eb-84c2-2f2b313614f3.png) Closes #32844 from sarutak/rand-example. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Shrinking the buffer up to size of the long type

bb40ef2

Fix of sample tests: particular values of sampled dataset must not be…

177afcc

… expected

MaxGekk changed the title ~~[SPARK-23643] Shrinking the buffer in hashSeed up to size of the seed parameter~~ [WIP][SPARK-23643] Shrinking the buffer in hashSeed up to size of the seed parameter Mar 11, 2018

MaxGekk closed this Jul 17, 2018

MaxGekk added 2 commits February 7, 2019 20:07

Revert "Fix of sample tests: particular values of sampled dataset mus…

738a220

…t not be expected" This reverts commit 177afcc.

Merge remote-tracking branch 'origin/master' into hash-buff-size

0d18fcd

MaxGekk reopened this Feb 8, 2019

MaxGekk added 7 commits February 8, 2019 13:38

Fix PairRDDFunctionsSuite

c00b8b8

Fix JavaAPISuite

05ee808

Fix RandomSuite

970fe6c

Fix DataFrameStatSuite

f061151

Fix DatasetSuite

a05df5c

Fix CSVSuite

ef42abe

Fix JsonSuite

efcc385

srowen reviewed Mar 21, 2019

View reviewed changes

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala Outdated Show resolved Hide resolved

MaxGekk added 3 commits March 21, 2019 14:34

Bump number of iteration up to 60

86892a4

Bump number of iterations up to 220

efda70c

Revert r2 to 4.0, and set n1 and n2 to 80

c758f57

MaxGekk added 3 commits March 21, 2019 21:55

Merge remote-tracking branch 'origin/master' into hash-buff-size

53faaef

Remove spaces

471841c

Regen reference values

0622e96

srowen requested changes Mar 22, 2019

View reviewed changes

Revert r2 to 4.0, and set n1 and n2 to 80 in another test

3754ede

tol = 0.00001

5774ad6

srowen closed this in 027ed2d Mar 23, 2019

MaxGekk deleted the hash-buff-size branch September 18, 2019 15:57

sarutak mentioned this pull request Jun 9, 2021

[MINOR][SQL] Modify the example of rand and randn. #32844

Closed

[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter #20793

[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter #20793

Uh oh!

Conversation

MaxGekk commented Mar 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

kiszk commented Mar 10, 2018

Uh oh!

felixcheung commented Mar 11, 2018

Uh oh!

viirya commented Mar 11, 2018

Uh oh!

SparkQA commented Mar 11, 2018

Uh oh!

kiszk commented Mar 11, 2018

Uh oh!

MaxGekk commented Mar 11, 2018

Uh oh!

MaxGekk commented Mar 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 11, 2018

Uh oh!

MaxGekk commented Jul 17, 2018

Uh oh!

SparkQA commented Feb 8, 2019

Uh oh!

SparkQA commented Feb 8, 2019

Uh oh!

MaxGekk commented Feb 8, 2019

Uh oh!

SparkQA commented Feb 8, 2019

Uh oh!

MaxGekk commented Feb 8, 2019

Uh oh!

srowen commented Feb 8, 2019

Uh oh!

srowen commented Feb 8, 2019

Uh oh!

Uh oh!

SparkQA commented Mar 21, 2019

Uh oh!

shaneknapp commented Mar 21, 2019

Uh oh!

SparkQA commented Mar 21, 2019

Uh oh!

SparkQA commented Mar 22, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

srowen Mar 22, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Mar 22, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Mar 22, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Mar 22, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2019

Uh oh!

SparkQA commented Mar 23, 2019

Uh oh!

srowen commented Mar 23, 2019

Uh oh!

MaxGekk commented Mar 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

MaxGekk commented Mar 10, 2018 •

edited

Loading

MaxGekk commented Mar 11, 2018 •

edited

Loading