Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 10, 2018

What changes were proposed in this pull request?

The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs.

How was this patch tested?

By running the existing tests - XORShiftRandomSuite

@kiszk
Copy link
Member

kiszk commented Mar 10, 2018

Good catch, LGTM

@felixcheung
Copy link
Member

Jenkins, ok to test

@viirya
Copy link
Member

viirya commented Mar 11, 2018

Does hashSeed method produce same hash value after this change?

scala> def hashSeed(seed: Long): Long = {
     |   val bytes = ByteBuffer.allocate(java.lang.Long.SIZE).putLong(seed).array()
     |   val lowBits = MurmurHash3.bytesHash(bytes)
     |   val highBits = MurmurHash3.bytesHash(bytes, lowBits)
     |   (highBits.toLong << 32) | (lowBits.toLong & 0xFFFFFFFFL)
     | }
hashSeed: (seed: Long)Long

scala> hashSeed(100)
res3: Long = 852394178374189935

scala> def hashSeed2(seed: Long): Long = {
     |   val bytes = ByteBuffer.allocate(java.lang.Long.BYTES).putLong(seed).array()
     |   val lowBits = MurmurHash3.bytesHash(bytes)
     |   val highBits = MurmurHash3.bytesHash(bytes, lowBits)
     |   (highBits.toLong << 32) | (lowBits.toLong & 0xFFFFFFFFL)
     | }
hashSeed2: (seed: Long)Long
scala> hashSeed2(100)
res7: Long = 1088402058313200430

@SparkQA
Copy link

SparkQA commented Mar 11, 2018

Test build #88156 has finished for PR 20793 at commit bb40ef2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Mar 11, 2018

Ah, results are different since the number of operations are different. It may be an issue like #20630.

I am curious why test are failure when seed is changed. Of course, I understand the sequence of rand must be reproducable with certain seed value in a package or implementation.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 11, 2018

At least some tests expect that particular values would be result of sample/random: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L550-L564 .

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 11, 2018

The question is that existing output of pseudo random/sample is guaranteed by public API or not? Probably not. Here was an attempt to make tests tolerant to seed: #8314

@MaxGekk MaxGekk changed the title [SPARK-23643] Shrinking the buffer in hashSeed up to size of the seed parameter [WIP][SPARK-23643] Shrinking the buffer in hashSeed up to size of the seed parameter Mar 11, 2018
@SparkQA
Copy link

SparkQA commented Mar 11, 2018

Test build #88160 has finished for PR 20793 at commit 177afcc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 17, 2018

I am closing the PR because it changes external behavior. Maybe I will create new one for Spark 3.0

@MaxGekk MaxGekk closed this Jul 17, 2018
@MaxGekk MaxGekk reopened this Feb 8, 2019
@SparkQA
Copy link

SparkQA commented Feb 8, 2019

Test build #102090 has finished for PR 20793 at commit 177afcc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2019

Test build #102091 has finished for PR 20793 at commit 0d18fcd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 8, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Feb 8, 2019

Test build #102101 has finished for PR 20793 at commit 0d18fcd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 8, 2019

@srowen Somehow many ML tests depend on hashSeed. I will change expected numbers in tests like DataFrameStatSuite.sampleBy but I am wondering why other (especially ML) tests depend on the seed.

@srowen
Copy link
Member

srowen commented Feb 8, 2019

The tests need to be deterministic, but you are correct that some tests go a little too far in asserting the exact result of the random sampling. If the result is still correct, it is reasonable to just update the new expected value. But it could be fine to loosen some tests too if they are overly specific without much value.

@srowen
Copy link
Member

srowen commented Feb 8, 2019

For example, tests like this ...
org.scalatest.exceptions.TestFailedException: Expected 2.7115417862490827 and 2.7355261 to be within 0.001 using relative tolerance.
.. should just be resolved by increasing the tolerance, probably in all cases. We can review to be sure. But it's evidence that the (arbitrary) tolerance was too tight as a correct implementation might vary its result by more, depending on the random seed.

@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103768 has started for PR 20793 at commit c758f57.

@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Mar 21, 2019

Test build #103787 has finished for PR 20793 at commit c758f57.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2019

Test build #103795 has finished for PR 20793 at commit 0622e96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, just a few more items to check here

mlpTestDF <- df
mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
expect_equal(head(mlpPredictions$prediction, 6), c("0.0", "1.0", "1.0", "1.0", "1.0", "1.0"))
expect_equal(head(mlpPredictions$prediction, 6), c("2.0", "2.0", "2.0", "2.0", "2.0", "2.0"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change concerns me; those predictions are all wrong now according to the data. It probably means the test was insufficient to begin with. I think the tolerance parameter is way too high; unset it if possible or use a much smaller value like 0.00001

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to set tol = 0.00001:

> head(summary$weights, 5)
[[1]]
[1] -24.28415

[[2]]
[1] 107.8701

[[3]]
[1] 16.86376

[[4]]
[1] 1.103736

[[5]]
[1] 9.244488

> mlpTestDF <- df
>   mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
> head(mlpPredictions$prediction, 6)
[1] "1.0" "1.0" "1.0" "1.0" "0.0" "1.0"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is ok, I will commit this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better -- those are actually the correct answers!

@SparkQA
Copy link

SparkQA commented Mar 22, 2019

Test build #103827 has finished for PR 20793 at commit 3754ede.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2019

Test build #103838 has finished for PR 20793 at commit 5774ad6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 23, 2019

Merged to master. That was a big difficult change, thank you. This doesn't just fix the seed issue but cleans up some tests along the way.

@srowen srowen closed this in 027ed2d Mar 23, 2019
@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 23, 2019

Thank you to everybody for your reviews, especially @srowen that you encouraged to finish the PR.

@MaxGekk MaxGekk deleted the hash-buff-size branch September 18, 2019 15:57
rshkv pushed a commit to palantir/spark that referenced this pull request Jun 18, 2020
…ize of the seed parameter

The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs.

By running the existing tests - XORShiftRandomSuite

Closes apache#20793 from MaxGekk/hash-buff-size.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
HyukjinKwon pushed a commit that referenced this pull request Jun 10, 2021
### What changes were proposed in this pull request?

This PR fixes the examples of `rand` and `randn`.

### Why are the changes needed?

SPARK-23643 (#20793) fixes an issue which is related to the seed and it causes the result of `rand` and `randn`.
Now the results of `SELECT rand(0)` and `SELECT randn((null)` are `0.7604953758285915` and `1.6034991609278433` respectively, and they should be deterministic because the number of partitions are always 1 (the leaf node is `OneRowRelation`).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Built the doc and confirmed it.
![rand-doc](https://user-images.githubusercontent.com/4736016/121359059-145a9b80-c96e-11eb-84c2-2f2b313614f3.png)

Closes #32844 from sarutak/rand-example.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants