Skip to content

Conversation

@squito
Copy link
Contributor

@squito squito commented Aug 19, 2015

https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if XORShiftRandom.hashSeed is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to long.

@mengxr @mkolod

@squito squito changed the title XORShiftRandom.hashSeed is random in high bits [SPARK-10116] [core] XORShiftRandom.hashSeed is random in high bits Aug 19, 2015
@srowen
Copy link
Member

srowen commented Aug 19, 2015

LGTM. There are better / faster RNGs in standard libs like commons math. @mengxr is it worth Spark having its own still?

@SparkQA
Copy link

SparkQA commented Aug 19, 2015

Test build #41251 has finished for PR 8314 at commit 148d723.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented Aug 19, 2015

hmm, some of the errors are just checks against the expected sequence from the rng, which I can update (though some of these tests probably shouldn't require a "perfect" seed). But I'm a little perplexed by some of the failures, eg. Word2Vec. I'm not an expert on that part of the code, but unless I've done something wrong here, that really shouldn't be so sensitive to the seed, right? Though that also seems to have a carefully chosen seed to make things work ...

@mengxr
Copy link
Contributor

mengxr commented Aug 28, 2015

Some unit tests and python doctests do depend on the seed, more or less sensitive. I don't think requiring exact output is that bad because it can at least notify us changes in behavior. In Python, the doctest is used to generate documentation. It is useful to show actual output rather than checking the bounds, e.g., https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L459.

There is a trade-off between having meaningful probabilistic bounds vs. keeping unit tests small. For example, in Word2Vec, we can increase the training dataset size to reduce the variance of the model output and hence robust to random seed, but that increases the test time too.

That being said, I can help make those tests less sensitive. Do you mind making JIRAs for each of them?

Regarding @srowen 's question, if adding commons-math3 dependency is not an issue and its RNG performs similarly to the one here. I think we shouldn't maintain our own. However, I'm still a little worried about compatibility issues between commons-math3 releases.

@srowen
Copy link
Member

srowen commented Aug 29, 2015

@mengxr commons math 3.x is already a dependency in core. I don't have benchmarks handy, but my experience with the RNGs is that they're at least "more than fast enough" for any purpose I've had. I don't think the RNGs are changing, and they implement particular RNG processes like Well19937 that should not change over time. The down-side to not using it is simply the higher probability of bugs, like this one, when implementing from scratch.

I think we need to get this fix in in any event. @squito do you need a hand? I think simply loosening the tests is appropriate, even if it means bigger tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: .map { seed =>

@squito
Copy link
Contributor Author

squito commented Oct 20, 2015

@srowen @mengxr sorry for letting this lay around for so long, somehow I completely forgot to update it. so I updated the tests with a few magic values, but didn't update the algorithms themselves. I tried a brute force search over 1000 seeds for Word2Vec, but none of them made the test pass. And it looks like MultiLayerPerceptronClassifier just ignores its seed.

fixing those remaining issues is probably a bit beyond me, but this can't be merged till we get those fixes. How would you like me to proceed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was a better way to leave a bigger margin for the low count cases, rather than increasing the multiplier. As another random point, 4 * stdev + 4 also worked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, this expression relies upon assuming that the binomial and Poisson distribution are well approximated by a normal distribution. When the expected value is in the 10s or 20s this probably isn't very true. This could be rewritten to properly compute the probability using PoissonDistribution and BinomialDistribution. However I think it would be faster to just make sure that the RDD size is not less than 1000 or so in the tests above. (Also, the parts where it computes the expected count with math.ceil are unnecessary: no reason to require these to be an integer, and they're another source of small errors. Let expected be a Double.)

@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #43988 has finished for PR 8314 at commit 9451cba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although normally it might not be 100% valid to make a 64-bit hash out of two 32-bit hashes this way (I'm not clear that reusing the seed doesn't connect the bits in some subtle way), it's certainly a big improvement and probably entirely fine for this purpose. I'd still like to remove XORShiftRandom, but that can be for another day.

@SparkQA
Copy link

SparkQA commented Oct 24, 2015

Test build #1950 has finished for PR 8314 at commit 9451cba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 24, 2015

@squito I think the test failures are legitimate although they are likely merely due to relying too much on the random sequence before.

@squito
Copy link
Contributor Author

squito commented Oct 26, 2015

Hi @srowen -- I know the test failures are legitimate, but I'm not sure what to do about them.

I tried a brute force search over 1000 seeds for Word2Vec, but none of them made the test pass. And it looks like MultiLayerPerceptronClassifier just ignores its seed.

I need a bit of guidance on how to fix things -- I dunno if the right solution is looser checks, or just updating the magic values assuming that the current implementation is correct. (And the right solution might be a bit beyond me abilities at the moment ...)

@srowen
Copy link
Member

srowen commented Oct 26, 2015

For JavaDataFrameSuite.testSampleBy I think you can accept any value between 1 and 6 for key 0, and 4 and 9 for key 1. These are not-improbable values given the test -- basically, how many of 33 elements do you choose if choosing with probability 0.1 and 0.2 respectively.

The Word2Vec test does look far too tight, I think. The others, I'm not as sure. I think the StreamingKMeansSuite just needs more points. Let me see if I can provide a concrete suggestion on these.

@squito
Copy link
Contributor Author

squito commented Oct 27, 2015

I updated some of the tests -- I think just Word2Vec and MultilayerPerceptron are remaining. For JavaDataFrameSuite.testSampleBy actually had to widen the range for key 1 up to 11, which I think is the 95%. DataFrameSuite uses a test setup special to sql ... I copped out and just updated the magic values. And for StreamingKMeanSuite, I just needed to update it to choose the centers based on which one was actually closer.

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #44458 has finished for PR 8314 at commit 134533c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 2, 2015

PS it's really on me to get this one working and check up on the tests. I'm back on it today. Thank you for pushing it this far. I want to get this one in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected values are 200, 300, 500. The ranges are very wide; really I'd do +/- 50. The final range should be a bit bigger; 430-570 is OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry about that. I think I was playing around with those ranges with much smaller samples until I decided to just use 1000 elements and didn't think about fixing the checks when they passed. I'll update.

@srowen
Copy link
Member

srowen commented Nov 2, 2015

@yinxusen I'm trying to help debug the test failure in Word2VecSuite added in c9d530e#diff-a081b952fe8b6e09a492ef37e157b456 I find that changing the seed at all makes the result quite different and the test fails. However the expected value is computed in a pretty clear way; I'm trying to figure out why a particular seed is required here. Are the word vectors chosen to work for the default seed of 42?

While looking at it I think we can fix one small thing in org.apache.spark.mllib.feature.Word2Vec. The initial random vector (which is what the seed affects) is not quite chosen uniformly:

      Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)

should be more like

    Array.fill[Float](vocabSize * vectorSize)((initRandom.nextGaussian().toFloat) / vectorSize)

This isn't really the issue but something we could adjust. Of course this also makes the test fail.

If we can't figure this out... I think we could hard-code this test to work with the new seed behavior, on the theory that it's probably a test issue.

@srowen
Copy link
Member

srowen commented Nov 2, 2015

The last failure is in MultilayerPerceptronClassifierSuite. The test is backwards in that expected/actual are flipped. It should be

assert(lrMetrics.confusionMatrix ~== mlpMetrics.confusionMatrix absTol 100)

That is the output ought to look something like

[info]   152.0  74.0   121.0  
[info]   65.0   167.0  64.0   
[info]   114.0  76.0   167.0  

... which is a little strange since this shows a fairly poor classifier's confusion matrix.

There are 2 seeds in this test and setting them to a range of values succeeds in every case for me. This one might be a matter of picking a different seed? although it is a little funny that in this case the MLP classifier never predicted class 0. But that's a different issue.

I also note that LogisticRegressionSuite uses a regular java.util.Random instead of XORShiftRandom. Might be worth adjusting, unless it causes more failures.

@yinxusen
Copy link
Contributor

yinxusen commented Nov 3, 2015

@srowen Yes the word vectors chosen to work for seed 42. But the default value for Word2Vec in ML package has been deleted. So I set the seed as 42 in the test suite.

@squito
Copy link
Contributor Author

squito commented Nov 4, 2015

how about I just hard-code new results for now, and I open two new jiras for fixing those cases properly?

@srowen
Copy link
Member

srowen commented Nov 4, 2015

@squito I tend to agree with that approach on the grounds that the point here is fixing the seeding. Making the tests less hard-coded is a separate issue if it needs to be. Maybe the perceptron test just needs a slightly different seed; maybe @yinxusen has a good answer for updating word2vec, but I don't think we should block this particular fix indefinitely otherwise.

@squito
Copy link
Contributor Author

squito commented Nov 4, 2015

@srowen good call on the extra seed for multilayer perceptron -- I found one pretty easily for the input data that made the test pass. Perhaps could still be more general, but at least its no worse than before. And I created https://issues.apache.org/jira/browse/SPARK-11502 for finding a better solution for Word2Vec.

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #45025 has finished for PR 8314 at commit 96eb00d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #45046 has finished for PR 8314 at commit e05a035.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor Author

squito commented Nov 5, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45093 has finished for PR 8314 at commit e05a035.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45101 has finished for PR 8314 at commit 04214d9.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 5, 2015

@squito very nice work here. I think the SparkR tests finally need a similar treatment and it's good to go. Thankfully I think the errors are quite plain about what they expect, and they do look like the expected differences you'd see from a seed change.

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45120 has finished for PR 8314 at commit 5e35321.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one really confused me ... I thought I could figure out what the right values should be by running the equivalent thing in scala, but I got totally different answers.

case class Data(val key: String)
val data = (0 to 99).map{x => Data((x % 3).toString)}
val dataDF = sqlContext.createDataFrame(data)
scala> dataDF.stat.sampleBy("key", Map("0" -> 0.1, "1" -> 0.2), 0).groupBy("key").count().show()
+---+-----+
|key|count|
+---+-----+
|  0|    5|
|  1|    5|
+---+-----+

but whatever, these vals work :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the default number of partitions is different in Scala/Python/R tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, great, glad there is a good explanation for this :)

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45143 has finished for PR 8314 at commit 7ce14fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 6, 2015

Great let's merge this one to master and 1.6. Not sure about 1.5 given that this is a low-priority fix and it ends up changing test behavior in several places.

asfgit pushed a commit that referenced this pull request Nov 6, 2015
https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.

(cherry picked from commit 49f1a82)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@asfgit asfgit closed this in 49f1a82 Nov 6, 2015
@avulanov
Copy link
Contributor

Could you notify the developer community about such changes in advance and point out to probable problematic places? That would be extremely helpful.

@srowen
Copy link
Member

srowen commented Nov 12, 2015

@avulanov what do you mean in this case? Just that default random behavior may change? That is a given though

@avulanov
Copy link
Contributor

@srowen Yes, exactly! In my case, the tests based on RNG were failing on amplab's jenkins and at the same time were running correctly in my environment, because my version of Spark was 1 day older than the one on jenkins. Indeed, it worth updating to the latest version all the time :) However, it would be great to have notifications about such changes after which one must update.

@srowen
Copy link
Member

srowen commented Nov 12, 2015

Yeah the problem was that the seed processing was actually wrong so had to change. I doubt it will change much. But stochastic behavior isn't part of the APIs so you would want to write tests that don't depend on a particular seed anyway. I dont think this is something special to note.

@avulanov
Copy link
Contributor

I was thinking about removing the stochastic part from the tests. However, the issue is that I need to test that stochastic initialization of parameters for machine learning does work, i.e. the optimization will converge with such parameters. Could you suggest a better way of doing this as opposed to use the seed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants