Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test_random.test_shuffle #10277

Closed
marcoabreu opened this issue Mar 27, 2018 · 12 comments
Closed

Flaky test_random.test_shuffle #10277

marcoabreu opened this issue Mar 27, 2018 · 12 comments

Comments

@marcoabreu
Copy link
Contributor

marcoabreu commented Mar 27, 2018

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/542/pipeline

======================================================================

FAIL: test_random.test_shuffle

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\test_random.py", line 625, in test_shuffle

    testSmall(mx.nd.arange(0, 3), 100, 20000)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\test_random.py", line 599, in testSmall

    assert abs(1. * count[str(mx.nd.array(p))] / repeat2 - 1. / math.factorial(data.shape[0])) < 0.01

AssertionError: 

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1485655931 to reproduce.

--------------------- >> end captured logging << ---------------------

https://issues.apache.org/jira/browse/MXNET-236

@marcoabreu
Copy link
Contributor Author

#10048

@reminisce @asitstands

@reminisce
Copy link
Contributor

reminisce commented Mar 28, 2018

@asitstands Could you verify whether 0.01 threshold is too small for checking the uniform distribution?

@asitstands
Copy link
Contributor

@reminisce The probability that the test fails is approximately 0.001 with the 0.01 criterion even if shuffle is implemented correctly, assuming the quality of the random numbers are good enough. So it could fail every 1000 runs of the test in average. It looks somewhat high probability. To lower the probability we can increase the number of samples to test uniformity of the shuffling. It is currently 20000. If we increase it as 40000, the failure probability is less than 10^-6 but the running time also increases. In my mahcine the running time increases from 40s to 55s. If this is not acceptable, another way is to relax the criterion as 0.015, which also makes the failure probability as less than 10^-6 but the test becomes less strict. I think that increasing the number of samples would be better. If the test again fails frequently with this increased number of samples, then there must be some problem in the implementation of shuffle or the random number generators. How do you think?

@reminisce
Copy link
Contributor

@asitstands Can you try feeding a fixed seed to the decorator with_seed and see whether the result distribution is fixed? If so, we can make that unit test deterministic.

@asitstands
Copy link
Contributor

@reminisce Seeding with with_seed produces a deterministic result. The test becomes significantly weaker comparing to the collective result produced by repeats with different seeds. However, if time consumption is important, I think that fixing the seed is reasonable. I'll make the change.

@marcoabreu
Copy link
Contributor Author

Time consumption for tests is not that important, but I think we should ask another question: If the generation of random numbers takes such a significant amount of time, wouldn't this also harm runtime performance in real applications?

@asitstands
Copy link
Contributor

asitstands commented Mar 29, 2018

@marcoabreu In the case of shuffle, the time consumption is mainly from the memory copy and not from the generation of random numbers. Random shuffling of n elements needs O(n) swaps of elements (e.g. exactly n for Fisher-Yates algorithm). The memory copy is the "nature" of shuffling. We cannot avoid it if we want to shuffle something. Testing statistical validity of the random shuffling needs a lot of samplings and this is also the nature of the problem. We cannot avoid it without sacrificing something such as running time or strictness. Basically serious statistical testing of shuffling may not be considered as a part of unit tests and needs to be a separate test. But practically we need to settle down with somewhat less strict test as a part of unit tests.

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Mar 29, 2018

I see, thanks for elaborating, this definitely makes sense. I assume we can't improve this with parallelisation, right?

@asitstands
Copy link
Contributor

asitstands commented Mar 30, 2018

@marcoabreu The implementation of shuffle already uses parallelism. On GPU, it utilizes a parallel sorting algorithm and the memory swap is also parallelized. On CPU it uses GNU's parallel implementation of std::random_shuffle if the array is 1D. It is much faster on GPU than CPU and you can find the gain of parallelization in CPU by controlling OMP_NUM_THREADS.
Only for higher dimensional arrays in CPU, the implementation is sequential. The optimization in this case is not trivial due to the diversity of data shape and hardware. I have tested a naive parallelization of the memory copy for the case. You can see the result in this comment. You have asked the test but I forgot mentioning you at the comment :)

@marcoabreu
Copy link
Contributor Author

Thanks for this great explanation and the detailed benchmarks, great job! In that case, I'd propose to go with the increase to 40000 samples - we can go even higher if you would like, the slaves have 72 vCPUs available (while I can't judge the GPU performance for this use-case) and since this task is IO-bound, we should not really see that much of an increase in time consumption as all threads can prepare the data in parallel and do something different while they're waiting for the memory controller. What do you think?

While pinning the seed makes things more comfortable, we have found quite a lot of issues in the past due to the random seeding.

@asitstands
Copy link
Contributor

asitstands commented Mar 30, 2018

As longer running time is allowed, I agree that using 40000 samples would be better than fixing the random seed. More strict tests are better as long as we have available resources for them. In the other hand, I think that 0.01 criterion and 40000 samples are practically enough. More strict test requires much more cost. I repoduced the failure with the reported seed and confirmed that the test is ok with 40000 samples. This test will fail again someday even with the increased number of samples, but the frequency is very low. How do you think @reminisce ?

@marcoabreu
Copy link
Contributor Author

I think 1^-6 sounds quite reasonable and the time consumption is still within acceptable boundaries :)

@reminisce please review the PR as you have more in-depth knowledge

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants