Flaky test_random.test_shuffle #10277

marcoabreu · 2018-03-27T20:21:35Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/542/pipeline

======================================================================

FAIL: test_random.test_shuffle

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\test_random.py", line 625, in test_shuffle

    testSmall(mx.nd.arange(0, 3), 100, 20000)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\test_random.py", line 599, in testSmall

    assert abs(1. * count[str(mx.nd.array(p))] / repeat2 - 1. / math.factorial(data.shape[0])) < 0.01

AssertionError: 

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1485655931 to reproduce.

--------------------- >> end captured logging << ---------------------

https://issues.apache.org/jira/browse/MXNET-236

The text was updated successfully, but these errors were encountered:

marcoabreu · 2018-03-27T20:22:00Z

#10048

@reminisce @asitstands

reminisce · 2018-03-28T03:48:03Z

@asitstands Could you verify whether 0.01 threshold is too small for checking the uniform distribution?

asitstands · 2018-03-28T04:14:30Z

@reminisce The probability that the test fails is approximately 0.001 with the 0.01 criterion even if shuffle is implemented correctly, assuming the quality of the random numbers are good enough. So it could fail every 1000 runs of the test in average. It looks somewhat high probability. To lower the probability we can increase the number of samples to test uniformity of the shuffling. It is currently 20000. If we increase it as 40000, the failure probability is less than 10^-6 but the running time also increases. In my mahcine the running time increases from 40s to 55s. If this is not acceptable, another way is to relax the criterion as 0.015, which also makes the failure probability as less than 10^-6 but the test becomes less strict. I think that increasing the number of samples would be better. If the test again fails frequently with this increased number of samples, then there must be some problem in the implementation of shuffle or the random number generators. How do you think?

reminisce · 2018-03-28T04:44:24Z

@asitstands Can you try feeding a fixed seed to the decorator with_seed and see whether the result distribution is fixed? If so, we can make that unit test deterministic.

asitstands · 2018-03-28T07:04:20Z

@reminisce Seeding with with_seed produces a deterministic result. The test becomes significantly weaker comparing to the collective result produced by repeats with different seeds. However, if time consumption is important, I think that fixing the seed is reasonable. I'll make the change.

marcoabreu · 2018-03-29T10:39:08Z

Time consumption for tests is not that important, but I think we should ask another question: If the generation of random numbers takes such a significant amount of time, wouldn't this also harm runtime performance in real applications?

asitstands · 2018-03-29T16:34:39Z

@marcoabreu In the case of shuffle, the time consumption is mainly from the memory copy and not from the generation of random numbers. Random shuffling of n elements needs O(n) swaps of elements (e.g. exactly n for Fisher-Yates algorithm). The memory copy is the "nature" of shuffling. We cannot avoid it if we want to shuffle something. Testing statistical validity of the random shuffling needs a lot of samplings and this is also the nature of the problem. We cannot avoid it without sacrificing something such as running time or strictness. Basically serious statistical testing of shuffling may not be considered as a part of unit tests and needs to be a separate test. But practically we need to settle down with somewhat less strict test as a part of unit tests.

marcoabreu · 2018-03-29T21:58:51Z

I see, thanks for elaborating, this definitely makes sense. I assume we can't improve this with parallelisation, right?

asitstands · 2018-03-30T07:28:30Z

@marcoabreu The implementation of shuffle already uses parallelism. On GPU, it utilizes a parallel sorting algorithm and the memory swap is also parallelized. On CPU it uses GNU's parallel implementation of std::random_shuffle if the array is 1D. It is much faster on GPU than CPU and you can find the gain of parallelization in CPU by controlling OMP_NUM_THREADS.
Only for higher dimensional arrays in CPU, the implementation is sequential. The optimization in this case is not trivial due to the diversity of data shape and hardware. I have tested a naive parallelization of the memory copy for the case. You can see the result in this comment. You have asked the test but I forgot mentioning you at the comment :)

marcoabreu · 2018-03-30T10:38:33Z

Thanks for this great explanation and the detailed benchmarks, great job! In that case, I'd propose to go with the increase to 40000 samples - we can go even higher if you would like, the slaves have 72 vCPUs available (while I can't judge the GPU performance for this use-case) and since this task is IO-bound, we should not really see that much of an increase in time consumption as all threads can prepare the data in parallel and do something different while they're waiting for the memory controller. What do you think?

While pinning the seed makes things more comfortable, we have found quite a lot of issues in the past due to the random seeding.

asitstands · 2018-03-30T16:05:08Z

As longer running time is allowed, I agree that using 40000 samples would be better than fixing the random seed. More strict tests are better as long as we have available resources for them. In the other hand, I think that 0.01 criterion and 40000 samples are practically enough. More strict test requires much more cost. I repoduced the failure with the reported seed and confirmed that the test is ok with 40000 samples. This test will fail again someday even with the increased number of samples, but the frequency is very low. How do you think @reminisce ?

marcoabreu · 2018-03-30T22:55:56Z

I think 1^-6 sounds quite reasonable and the time consumption is still within acceptable boundaries :)

@reminisce please review the PR as you have more in-depth knowledge

marcoabreu mentioned this issue Mar 27, 2018

[MXNET-68] Random shuffle implementation #10048

Merged

5 tasks

marcoabreu added Test Flaky labels Mar 27, 2018

asitstands mentioned this issue Mar 29, 2018

[MXNET-258] Improve flaky test_random.test_shuffle #10313

Merged

5 tasks

reminisce closed this as completed May 30, 2018

zixuanweeei mentioned this issue Aug 16, 2019

[CI] unix cpu validation Timeout #15880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test_random.test_shuffle #10277

Flaky test_random.test_shuffle #10277

marcoabreu commented Mar 27, 2018 •

edited

Loading

marcoabreu commented Mar 27, 2018

reminisce commented Mar 28, 2018 •

edited

Loading

asitstands commented Mar 28, 2018

reminisce commented Mar 28, 2018

asitstands commented Mar 28, 2018

marcoabreu commented Mar 29, 2018

asitstands commented Mar 29, 2018 •

edited

Loading

marcoabreu commented Mar 29, 2018 •

edited

Loading

asitstands commented Mar 30, 2018 •

edited

Loading

marcoabreu commented Mar 30, 2018

asitstands commented Mar 30, 2018 •

edited

Loading

marcoabreu commented Mar 30, 2018

Flaky test_random.test_shuffle #10277

Flaky test_random.test_shuffle #10277

Comments

marcoabreu commented Mar 27, 2018 • edited Loading

marcoabreu commented Mar 27, 2018

reminisce commented Mar 28, 2018 • edited Loading

asitstands commented Mar 28, 2018

reminisce commented Mar 28, 2018

asitstands commented Mar 28, 2018

marcoabreu commented Mar 29, 2018

asitstands commented Mar 29, 2018 • edited Loading

marcoabreu commented Mar 29, 2018 • edited Loading

asitstands commented Mar 30, 2018 • edited Loading

marcoabreu commented Mar 30, 2018

asitstands commented Mar 30, 2018 • edited Loading

marcoabreu commented Mar 30, 2018

marcoabreu commented Mar 27, 2018 •

edited

Loading

reminisce commented Mar 28, 2018 •

edited

Loading

asitstands commented Mar 29, 2018 •

edited

Loading

marcoabreu commented Mar 29, 2018 •

edited

Loading

asitstands commented Mar 30, 2018 •

edited

Loading

asitstands commented Mar 30, 2018 •

edited

Loading