[SPARK-10116] [core] XORShiftRandom.hashSeed is random in high bits #8314

squito · 2015-08-19T15:32:21Z

https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if XORShiftRandom.hashSeed is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to long.

@mengxr @mkolod

srowen · 2015-08-19T15:49:21Z

LGTM. There are better / faster RNGs in standard libs like commons math. @mengxr is it worth Spark having its own still?

SparkQA · 2015-08-19T18:09:57Z

Test build #41251 has finished for PR 8314 at commit 148d723.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-08-19T20:49:43Z

hmm, some of the errors are just checks against the expected sequence from the rng, which I can update (though some of these tests probably shouldn't require a "perfect" seed). But I'm a little perplexed by some of the failures, eg. Word2Vec. I'm not an expert on that part of the code, but unless I've done something wrong here, that really shouldn't be so sensitive to the seed, right? Though that also seems to have a carefully chosen seed to make things work ...

mengxr · 2015-08-28T20:29:01Z

Some unit tests and python doctests do depend on the seed, more or less sensitive. I don't think requiring exact output is that bad because it can at least notify us changes in behavior. In Python, the doctest is used to generate documentation. It is useful to show actual output rather than checking the bounds, e.g., https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L459.

There is a trade-off between having meaningful probabilistic bounds vs. keeping unit tests small. For example, in Word2Vec, we can increase the training dataset size to reduce the variance of the model output and hence robust to random seed, but that increases the test time too.

That being said, I can help make those tests less sensitive. Do you mind making JIRAs for each of them?

Regarding @srowen 's question, if adding commons-math3 dependency is not an issue and its RNG performs similarly to the one here. I think we shouldn't maintain our own. However, I'm still a little worried about compatibility issues between commons-math3 releases.

srowen · 2015-08-29T08:06:27Z

@mengxr commons math 3.x is already a dependency in core. I don't have benchmarks handy, but my experience with the RNGs is that they're at least "more than fast enough" for any purpose I've had. I don't think the RNGs are changing, and they implement particular RNG processes like Well19937 that should not change over time. The down-side to not using it is simply the higher probability of bugs, like this one, when implementing from scratch.

I think we need to get this fix in in any event. @squito do you need a hand? I think simply loosening the tests is appropriate, even if it means bigger tests.

andrewor14 · 2015-09-01T18:59:03Z

core/src/test/scala/org/apache/spark/util/random/XORShiftRandomSuite.scala

style: .map { seed =>

squito · 2015-10-20T15:53:08Z

@srowen @mengxr sorry for letting this lay around for so long, somehow I completely forgot to update it. so I updated the tests with a few magic values, but didn't update the algorithms themselves. I tried a brute force search over 1000 seeds for Word2Vec, but none of them made the test pass. And it looks like MultiLayerPerceptronClassifier just ignores its seed.

fixing those remaining issues is probably a bit beyond me, but this can't be merged till we get those fixes. How would you like me to proceed here?

squito · 2015-10-20T15:54:57Z

core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala

I thought this was a better way to leave a bigger margin for the low count cases, rather than increasing the multiplier. As another random point, 4 * stdev + 4 also worked.

Really, this expression relies upon assuming that the binomial and Poisson distribution are well approximated by a normal distribution. When the expected value is in the 10s or 20s this probably isn't very true. This could be rewritten to properly compute the probability using PoissonDistribution and BinomialDistribution. However I think it would be faster to just make sure that the RDD size is not less than 1000 or so in the tests above. (Also, the parts where it computes the expected count with math.ceil are unnecessary: no reason to require these to be an integer, and they're another source of small errors. Let expected be a Double.)

SparkQA · 2015-10-20T18:53:21Z

Test build #43988 has finished for PR 8314 at commit 9451cba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-10-24T17:47:25Z

core/src/main/scala/org/apache/spark/util/random/XORShiftRandom.scala

Although normally it might not be 100% valid to make a 64-bit hash out of two 32-bit hashes this way (I'm not clear that reusing the seed doesn't connect the bits in some subtle way), it's certainly a big improvement and probably entirely fine for this purpose. I'd still like to remove XORShiftRandom, but that can be for another day.

SparkQA · 2015-10-24T20:05:37Z

Test build #1950 has finished for PR 8314 at commit 9451cba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-10-24T21:17:18Z

@squito I think the test failures are legitimate although they are likely merely due to relying too much on the random sequence before.

squito · 2015-10-26T15:43:01Z

Hi @srowen -- I know the test failures are legitimate, but I'm not sure what to do about them.

I tried a brute force search over 1000 seeds for Word2Vec, but none of them made the test pass. And it looks like MultiLayerPerceptronClassifier just ignores its seed.

I need a bit of guidance on how to fix things -- I dunno if the right solution is looser checks, or just updating the magic values assuming that the current implementation is correct. (And the right solution might be a bit beyond me abilities at the moment ...)

srowen · 2015-10-26T16:06:07Z

For JavaDataFrameSuite.testSampleBy I think you can accept any value between 1 and 6 for key 0, and 4 and 9 for key 1. These are not-improbable values given the test -- basically, how many of 33 elements do you choose if choosing with probability 0.1 and 0.2 respectively.

The Word2Vec test does look far too tight, I think. The others, I'm not as sure. I think the StreamingKMeansSuite just needs more points. Let me see if I can provide a concrete suggestion on these.

squito · 2015-10-27T19:09:40Z

I updated some of the tests -- I think just Word2Vec and MultilayerPerceptron are remaining. For JavaDataFrameSuite.testSampleBy actually had to widen the range for key 1 up to 11, which I think is the 95%. DataFrameSuite uses a test setup special to sql ... I copped out and just updated the magic values. And for StreamingKMeanSuite, I just needed to update it to choose the centers based on which one was actually closer.

SparkQA · 2015-10-27T21:22:03Z

Test build #44458 has finished for PR 8314 at commit 134533c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-11-02T09:17:45Z

PS it's really on me to get this one working and check up on the tests. I'm back on it today. Thank you for pushing it this far. I want to get this one in.

srowen · 2015-11-02T09:50:06Z

core/src/test/java/org/apache/spark/JavaAPISuite.java

Expected values are 200, 300, 500. The ranges are very wide; really I'd do +/- 50. The final range should be a bit bigger; 430-570 is OK.

yes, sorry about that. I think I was playing around with those ranges with much smaller samples until I decided to just use 1000 elements and didn't think about fixing the checks when they passed. I'll update.

srowen · 2015-11-02T10:51:27Z

@yinxusen I'm trying to help debug the test failure in Word2VecSuite added in c9d530e#diff-a081b952fe8b6e09a492ef37e157b456 I find that changing the seed at all makes the result quite different and the test fails. However the expected value is computed in a pretty clear way; I'm trying to figure out why a particular seed is required here. Are the word vectors chosen to work for the default seed of 42?

While looking at it I think we can fix one small thing in org.apache.spark.mllib.feature.Word2Vec. The initial random vector (which is what the seed affects) is not quite chosen uniformly:

      Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)

should be more like

    Array.fill[Float](vocabSize * vectorSize)((initRandom.nextGaussian().toFloat) / vectorSize)

This isn't really the issue but something we could adjust. Of course this also makes the test fail.

If we can't figure this out... I think we could hard-code this test to work with the new seed behavior, on the theory that it's probably a test issue.

srowen · 2015-11-02T11:08:48Z

The last failure is in MultilayerPerceptronClassifierSuite. The test is backwards in that expected/actual are flipped. It should be

assert(lrMetrics.confusionMatrix ~== mlpMetrics.confusionMatrix absTol 100)

That is the output ought to look something like

[info]   152.0  74.0   121.0  
[info]   65.0   167.0  64.0   
[info]   114.0  76.0   167.0

... which is a little strange since this shows a fairly poor classifier's confusion matrix.

There are 2 seeds in this test and setting them to a range of values succeeds in every case for me. This one might be a matter of picking a different seed? although it is a little funny that in this case the MLP classifier never predicted class 0. But that's a different issue.

I also note that LogisticRegressionSuite uses a regular java.util.Random instead of XORShiftRandom. Might be worth adjusting, unless it causes more failures.

yinxusen · 2015-11-03T03:53:10Z

@srowen Yes the word vectors chosen to work for seed 42. But the default value for Word2Vec in ML package has been deleted. So I set the seed as 42 in the test suite.

squito · 2015-11-04T16:15:18Z

how about I just hard-code new results for now, and I open two new jiras for fixing those cases properly?

srowen · 2015-11-04T16:49:47Z

@squito I tend to agree with that approach on the grounds that the point here is fixing the seeding. Making the tests less hard-coded is a separate issue if it needs to be. Maybe the perceptron test just needs a slightly different seed; maybe @yinxusen has a good answer for updating word2vec, but I don't think we should block this particular fix indefinitely otherwise.

squito · 2015-11-04T17:02:24Z

@srowen good call on the extra seed for multilayer perceptron -- I found one pretty easily for the input data that made the test pass. Perhaps could still be more general, but at least its no worse than before. And I created https://issues.apache.org/jira/browse/SPARK-11502 for finding a better solution for Word2Vec.

SparkQA · 2015-11-04T19:13:26Z

Test build #45025 has finished for PR 8314 at commit 96eb00d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-04T23:53:11Z

Test build #45046 has finished for PR 8314 at commit e05a035.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-11-05T02:51:09Z

Jenkins, retest this please

SparkQA · 2015-11-05T05:37:37Z

Test build #45093 has finished for PR 8314 at commit e05a035.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-05T09:09:41Z

Test build #45101 has finished for PR 8314 at commit 04214d9.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-11-05T09:13:13Z

@squito very nice work here. I think the SparkR tests finally need a similar treatment and it's good to go. Thankfully I think the errors are quite plain about what they expect, and they do look like the expected differences you'd see from a seed change.

SparkQA · 2015-11-05T18:31:40Z

Test build #45120 has finished for PR 8314 at commit 5e35321.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-11-05T20:45:58Z

R/pkg/inst/tests/test_sparkSQL.R

this one really confused me ... I thought I could figure out what the right values should be by running the equivalent thing in scala, but I got totally different answers.

case class Data(val key: String) val data = (0 to 99).map{x => Data((x % 3).toString)} val dataDF = sqlContext.createDataFrame(data) scala> dataDF.stat.sampleBy("key", Map("0" -> 0.1, "1" -> 0.2), 0).groupBy("key").count().show() +---+-----+ |key|count| +---+-----+ | 0| 5| | 1| 5| +---+-----+

but whatever, these vals work :/

This is because the default number of partitions is different in Scala/Python/R tests.

ah, great, glad there is a good explanation for this :)

SparkQA · 2015-11-05T23:52:49Z

Test build #45143 has finished for PR 8314 at commit 7ce14fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-11-06T17:46:33Z

Great let's merge this one to master and 1.6. Not sure about 1.5 given that this is a low-priority fix and it ends up changing test behavior in several places.

https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116. (cherry picked from commit 49f1a82) Signed-off-by: Sean Owen <sowen@cloudera.com>

avulanov · 2015-11-12T02:00:26Z

Could you notify the developer community about such changes in advance and point out to probable problematic places? That would be extremely helpful.

srowen · 2015-11-12T06:36:17Z

@avulanov what do you mean in this case? Just that default random behavior may change? That is a given though

avulanov · 2015-11-12T18:00:30Z

@srowen Yes, exactly! In my case, the tests based on RNG were failing on amplab's jenkins and at the same time were running correctly in my environment, because my version of Spark was 1 day older than the one on jenkins. Indeed, it worth updating to the latest version all the time :) However, it would be great to have notifications about such changes after which one must update.

srowen · 2015-11-12T18:03:25Z

Yeah the problem was that the seed processing was actually wrong so had to change. I doubt it will change much. But stochastic behavior isn't part of the APIs so you would want to write tests that don't depend on a particular seed anyway. I dont think this is something special to note.

avulanov · 2015-11-12T19:24:16Z

I was thinking about removing the stochastic part from the tests. However, the issue is that I need to test that stochastic initialization of parameters for machine learning does work, i.e. the optimization will converge with such parameters. Could you suggest a better way of doing this as opposed to use the seed?

squito changed the title ~~XORShiftRandom.hashSeed is random in high bits~~ [SPARK-10116] [core] XORShiftRandom.hashSeed is random in high bits Aug 19, 2015

andrewor14 reviewed Sep 1, 2015
View reviewed changes

core/src/test/scala/org/apache/spark/util/random/XORShiftRandomSuite.scala Outdated

Copy link

Contributor

andrewor14 Sep 1, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: .map { seed =>

squito added 5 commits October 20, 2015 09:33

XORShiftRandom.hashSeed us random in high bits

a17530e

fix JavaAPISuite

339ccc3

style

7c46724

increase tolerance (for low count tests)

ea5ff69

update magic values

9451cba

squito force-pushed the SPARK-10116 branch from 148d723 to 9451cba Compare October 20, 2015 15:49

squito reviewed Oct 20, 2015
View reviewed changes

srowen reviewed Oct 24, 2015
View reviewed changes

update tests

134533c

srowen reviewed Nov 2, 2015
View reviewed changes

squito added 2 commits November 4, 2015 10:59

better seed for MultilayerPerceptron; magic values for Word2Vec

0547528

Merge branch 'master' into SPARK-10116

96eb00d

more magic vals for python tests

e05a035

more pyspark magic vals

04214d9

magic R values

5e35321

another try at finding the right magic values

7ce14fa

squito reviewed Nov 5, 2015
View reviewed changes

asfgit closed this in 49f1a82 Nov 6, 2015

MaxGekk mentioned this pull request Mar 11, 2018

[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter #20793

Closed

[SPARK-10116] [core] XORShiftRandom.hashSeed is random in high bits #8314

[SPARK-10116] [core] XORShiftRandom.hashSeed is random in high bits #8314

Uh oh!

Conversation

squito commented Aug 19, 2015

Uh oh!

srowen commented Aug 19, 2015

Uh oh!

SparkQA commented Aug 19, 2015

Uh oh!

squito commented Aug 19, 2015

Uh oh!

mengxr commented Aug 28, 2015

Uh oh!

srowen commented Aug 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Oct 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 24, 2015

Uh oh!

srowen commented Oct 24, 2015

Uh oh!

squito commented Oct 26, 2015

Uh oh!

srowen commented Oct 26, 2015

Uh oh!

squito commented Oct 27, 2015

Uh oh!

SparkQA commented Oct 27, 2015

Uh oh!

srowen commented Nov 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Nov 2, 2015

Uh oh!

srowen commented Nov 2, 2015

Uh oh!

yinxusen commented Nov 3, 2015

Uh oh!

squito commented Nov 4, 2015

Uh oh!

srowen commented Nov 4, 2015

Uh oh!

squito commented Nov 4, 2015

Uh oh!

SparkQA commented Nov 4, 2015

Uh oh!

SparkQA commented Nov 4, 2015

Uh oh!

squito commented Nov 5, 2015

Uh oh!

SparkQA commented Nov 5, 2015

Uh oh!

SparkQA commented Nov 5, 2015

Uh oh!

srowen commented Nov 5, 2015

Uh oh!

SparkQA commented Nov 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2015

Uh oh!