Tighten test_parallel bound #3278

austereantelope · 2021-12-10T20:36:15Z

Hi,

The test test_parallel in test_word2vec.py has an assertion bound (self.assertLess(neighbor_rank, 20)) that is too loose. This means potential bug in the code could still pass the original test.

To quantify this I conducted some experiments where I generated multiple mutations of the source code under test and ran each mutant and the original code 100 times to build a distribution of their outputs. I used KS-test to find mutants that produced a different distribution from the original and use those mutants as a proxy for bugs that could be introduced. In the graph below I show the distribution of both the original code and also the mutants with a different distribution.

Here we see that the bound of 20 is too loose since the original distribution (in orange) is much less than 20. Furthermore in this graph we can observe that there are many mutants (proxy for bugs) that are below the bound as well and that is undesirable since the test should aim to catch potential bugs in code. I quantify the "bug detection" of this assertion by varying the bound in a trade-off graph below.

In this graph, I plot the mutant catch rate (ratio of mutant outputs that fail the test) and the original pass rate (ratio of original output that pass the test). The original bound of 20 (red dotted line) has a catch rate of 0.5.

To improve this test, I propose to tighten the bound to 2 (the blue dotted line). The new bound has a catch rate of 0.8 (+0.3 increase compare to original) while still has 100 % pass rate (test is not flaky, I ran the updated test 500 times and had no failures). I think this is a good balance between improving the bug-detection ability of the test while keep the flakiness of the test low.

Do you guys think this makes sense? Please let me know if this looks good or if you have any other suggestions or questions.

My Environment:

python=3.7.11
numpy=1.21.4

my Gensim Experiment SHA:
6e362663f23967f3c1931e2cb18d3d25f92aabb5

piskvorky · 2021-12-10T21:03:55Z

Thanks for the investigation! IIRC the rank<20 test is there to account for flakiness (failed CI builds).

If you say the test is now stable at rank<2 that's great – but let's wait for the CI results. Because the CI servers (github actions, travis) often show "surprising" behaviour, that we don't see on our dev machines.

codecov · 2021-12-10T21:19:50Z

Codecov Report

Merging #3278 (e95cb8c) into develop (48d6e55) will increase coverage by 0.45%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #3278      +/-   ##
===========================================
+ Coverage    78.91%   79.37%   +0.45%     
===========================================
  Files           68       68              
  Lines        11781    11781              
===========================================
+ Hits          9297     9351      +54     
+ Misses        2484     2430      -54

Impacted Files	Coverage Δ
gensim/utils.py	`71.86% <0.00%> (+0.81%)`	⬆️
gensim/corpora/wikicorpus.py	`93.75% <0.00%> (+11.53%)`	⬆️
gensim/scripts/segment_wiki.py	`92.37% <0.00%> (+21.18%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48d6e55...e95cb8c. Read the comment docs.

piskvorky · 2021-12-10T22:28:16Z

All tests passed; let me re-run them for N=2 :) Looks good so far – perhaps this greater stability is due to something we fixed in 4.0.0 @gojomo ?

gojomo · 2021-12-11T00:27:44Z

A more-rigorous & experimentally-driven tuning of these constraints is a good idea.

However, historically, it's been tough in any one (developer) system, or small number of runs, to fully predict the range of results that even (presumably) bug-free code might generate, given the many stochastic elements in the code. In particular, for unclear reasons, the CI machines have tended to show greater run-to-run variance than developer machines. So bounds that (as a contrived example inspired by vague memories of prior investigations) fail 0 out of 1000 runs on my developer setup, and thus appear non-flaky, turn out to fail more than 1 out of 10 times on the CI machine. (Perhaps: their particular high-tenancy systems have far more threaad-scheduling variance caused by other unpredictable loads on the same machine?)

So where these constraints are loose, it's often because of past problems with distracting, but non-informative, near-miss test failures. And we have a number of such stochastic threshold tests - so altogether, even rare failures across all such tests may generate a lot of distraction for developers whose unrelated changes trigger failures.

So: a full testing of new bounds would ideally involve many trials on the CI machine (& even at a variety of times, rather than a short tight loop). And, the evaluation code, & results, that drive the new choice should itself be version-controlled/documented. And, all such thresholds, not just a few that catch arbitrary notice, should be similarly tuned.

mpenkov · 2021-12-13T14:39:32Z

Looks good to me. Thank you for the rigorous PR @austereantelope and congrats on your first contribution to the gensim repo!

tighten test_parallel bound

e95cb8c

piskvorky requested a review from mpenkov December 10, 2021 22:27

mpenkov approved these changes Dec 13, 2021

View reviewed changes

mpenkov merged commit 7d7bb84 into piskvorky:develop Dec 13, 2021

austereantelope mentioned this pull request Dec 25, 2021

adjust test_parallel bound #3281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten test_parallel bound #3278

Tighten test_parallel bound #3278

austereantelope commented Dec 10, 2021

piskvorky commented Dec 10, 2021 •

edited

Loading

codecov bot commented Dec 10, 2021 •

edited

Loading

piskvorky commented Dec 10, 2021 •

edited

Loading

gojomo commented Dec 11, 2021

mpenkov commented Dec 13, 2021

Tighten test_parallel bound #3278

Tighten test_parallel bound #3278

Conversation

austereantelope commented Dec 10, 2021

piskvorky commented Dec 10, 2021 • edited Loading

codecov bot commented Dec 10, 2021 • edited Loading

Codecov Report

piskvorky commented Dec 10, 2021 • edited Loading

gojomo commented Dec 11, 2021

mpenkov commented Dec 13, 2021

piskvorky commented Dec 10, 2021 •

edited

Loading

codecov bot commented Dec 10, 2021 •

edited

Loading

piskvorky commented Dec 10, 2021 •

edited

Loading