This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Patch description
Hi,
The test
test_xlm
intest_transformers.py
has an assertion bound (self.assertLessEqual(test['ppl'], 1.3)
) that is too loose. This means potential bug in the code could still pass the original test.To quantify this I conducted some experiments where I generated multiple mutations of the source code under test and ran each mutant and the original code 100 times to build a distribution of their outputs. I used KS-test to find mutants that produced a different distribution from the original and use those mutants as a proxy for bugs that could be introduced. In the graph below I show the distribution of both the original code and also the mutants with a different distribution.
Here we see that the bound of
1.3
is too loose since the original distribution (in orange) is much less than1.3
. Furthermore in this graph we can observe that there are many mutants (proxy for bugs) that are below the bound as well and that is undesirable since the test should aim to catch potential bugs in code. I quantify the "bug detection" of this assertion by varying the bound in a trade-off graph below.In this graph, I plot the mutant catch rate (ratio of mutant outputs that fail the test) and the original pass rate (ratio of original output that pass the test). The original bound of
1.3
(red dotted line) has a catch rate of 0.To improve this test, I propose to tighten the bound to
1.02
(the blue dotted line). The new bound has a catch rate of 0.17 (+0.17 increase compare to original) while still has >99 % pass rate (test is not flaky, I ran the updated test 500 times and observed >99 % pass rate). I think this is a good balance between improving the bug-detection ability of the test while keep the flakiness of the test low.Furthermore, I observed that the value of the assertions in this test never goes below 1. Would it be helpful to include an additional assertion for example
assertGreaterEqual(test['ppl'], 1)
as well? Some mutants can change the value to below 1 and current assertion setup would miss those mutants/bugs. By adding the additional assertion of greaterthan 1 we can improve the catch rate further to 0.23Do you guys think this makes sense? Please let me know if this looks good or if you have any other suggestions or questions.
My Environment:
my parlai Experiment SHA:
4b1d07d0eeb14f849ad930eeb001327f9bfc2db1