-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter #20793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Good catch, LGTM |
|
Jenkins, ok to test |
|
Does scala> def hashSeed(seed: Long): Long = {
| val bytes = ByteBuffer.allocate(java.lang.Long.SIZE).putLong(seed).array()
| val lowBits = MurmurHash3.bytesHash(bytes)
| val highBits = MurmurHash3.bytesHash(bytes, lowBits)
| (highBits.toLong << 32) | (lowBits.toLong & 0xFFFFFFFFL)
| }
hashSeed: (seed: Long)Long
scala> hashSeed(100)
res3: Long = 852394178374189935
scala> def hashSeed2(seed: Long): Long = {
| val bytes = ByteBuffer.allocate(java.lang.Long.BYTES).putLong(seed).array()
| val lowBits = MurmurHash3.bytesHash(bytes)
| val highBits = MurmurHash3.bytesHash(bytes, lowBits)
| (highBits.toLong << 32) | (lowBits.toLong & 0xFFFFFFFFL)
| }
hashSeed2: (seed: Long)Long
scala> hashSeed2(100)
res7: Long = 1088402058313200430 |
|
Test build #88156 has finished for PR 20793 at commit
|
|
Ah, results are different since the number of operations are different. It may be an issue like #20630. I am curious why test are failure when seed is changed. Of course, I understand the sequence of rand must be reproducable with certain seed value in a package or implementation. |
|
At least some tests expect that particular values would be result of sample/random: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L550-L564 . |
|
The question is that existing output of pseudo random/sample is guaranteed by public API or not? Probably not. Here was an attempt to make tests tolerant to seed: #8314 |
|
Test build #88160 has finished for PR 20793 at commit
|
|
I am closing the PR because it changes external behavior. Maybe I will create new one for Spark 3.0 |
…t not be expected" This reverts commit 177afcc.
|
Test build #102090 has finished for PR 20793 at commit
|
|
Test build #102091 has finished for PR 20793 at commit
|
|
jenkins, retest this, please |
|
Test build #102101 has finished for PR 20793 at commit
|
|
@srowen Somehow many ML tests depend on |
|
The tests need to be deterministic, but you are correct that some tests go a little too far in asserting the exact result of the random sampling. If the result is still correct, it is reasonable to just update the new expected value. But it could be fine to loosen some tests too if they are overly specific without much value. |
|
For example, tests like this ... |
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #103768 has started for PR 20793 at commit |
|
test this please |
|
Test build #103787 has finished for PR 20793 at commit
|
|
Test build #103795 has finished for PR 20793 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, just a few more items to check here
mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala
Outdated
Show resolved
Hide resolved
| mlpTestDF <- df | ||
| mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) | ||
| expect_equal(head(mlpPredictions$prediction, 6), c("0.0", "1.0", "1.0", "1.0", "1.0", "1.0")) | ||
| expect_equal(head(mlpPredictions$prediction, 6), c("2.0", "2.0", "2.0", "2.0", "2.0", "2.0")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change concerns me; those predictions are all wrong now according to the data. It probably means the test was insufficient to begin with. I think the tolerance parameter is way too high; unset it if possible or use a much smaller value like 0.00001
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to set tol = 0.00001:
> head(summary$weights, 5)
[[1]]
[1] -24.28415
[[2]]
[1] 107.8701
[[3]]
[1] 16.86376
[[4]]
[1] 1.103736
[[5]]
[1] 9.244488
> mlpTestDF <- df
> mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
> head(mlpPredictions$prediction, 6)
[1] "1.0" "1.0" "1.0" "1.0" "0.0" "1.0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is ok, I will commit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better -- those are actually the correct answers!
|
Test build #103827 has finished for PR 20793 at commit
|
|
Test build #103838 has finished for PR 20793 at commit
|
|
Merged to master. That was a big difficult change, thank you. This doesn't just fix the seed issue but cleans up some tests along the way. |
|
Thank you to everybody for your reviews, especially @srowen that you encouraged to finish the PR. |
…ize of the seed parameter The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs. By running the existing tests - XORShiftRandomSuite Closes apache#20793 from MaxGekk/hash-buff-size. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request? This PR fixes the examples of `rand` and `randn`. ### Why are the changes needed? SPARK-23643 (#20793) fixes an issue which is related to the seed and it causes the result of `rand` and `randn`. Now the results of `SELECT rand(0)` and `SELECT randn((null)` are `0.7604953758285915` and `1.6034991609278433` respectively, and they should be deterministic because the number of partitions are always 1 (the leaf node is `OneRowRelation`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc and confirmed it.  Closes #32844 from sarutak/rand-example. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs.
How was this patch tested?
By running the existing tests - XORShiftRandomSuite