-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistically verify result #288
Statistically verify result #288
Conversation
62e34ec
to
e5ce474
Compare
Eum, turned out to be trivial to add support for aborting a test that cannot satisfy coverage. So added that 🎉 |
"Test coverage cannot be reached, aborted" | ||
Nothing | ||
[] | ||
|
||
else if tests >= fromIntegral (propertyTestLimit cfg) then | ||
-- we've hit the test limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note here is that just as we can stop early if we'll never reach the desired coverage - we can also stop early if we've already reached it!
I guess that is sort of a design decision - should we stop if we've already reached the desired coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - turned out that there was a bug hiding if we didn't do this. Fixed it - I'll give this a rest now (oh and the build failure above was spurious due to not being able to pull packages)
e1d73b9
to
3efa763
Compare
-- for 1 in 10^9 tests. | ||
newtype Confidence = | ||
Confidence { | ||
unConfidence :: Integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you envision using this with confidence greater than 10^18
? (Int64
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuickCheck gives a pretty good explanation of this in their docs:
If you are using checkCoverage as part of a test suite, you should be careful not to set certainty too low. If you want, say, a 1% chance of a false positive during a project's lifetime, then certainty should be set to at least 100 * m * n, where m is the number of uses of cover in the test suite, and n is the number of times you expect the test suite to be run during the project's lifetime. The default value is chosen to be big enough for most projects.
So for instance - let's say we're writing a big project with 100 ocurrences of cover
, then we'd get:
100 * 100 * n = confidence
Let's say that we expect to run the project 20 times per day per developer with 5 developers. That would make it:
100 * 100 * (20 * 5 * daysWithoutFalsePositives) = confidence
<=>
1 000 000 * daysWithoutFalsePositives = confidence
Let's say we want 5 years with 1% chance of false positives:
1 000 000 * daysWithoutFalsePositives = confidence
<=>
1 000 000 * 365 * 5 = 1 825 000 000
Which is veeeeeeery roughly 2*10^9
. So yeah, maybe not. Would you rather use Int64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we're talking age of the universe scales at 10^18.
📈 All looks pretty legit, although the math/stats is a bit beyond me @HuwCampbell if you're around, would you mind having a look at this? |
I'll try and have a look before the weekend. |
Sorry I don't think the statistical reasoning is correct, you appear to have mixed different metrics inappropriately. I personally would not have gone with a Wilson metric (but whatever really), and used a Beta distribution prior to approximate the confidence interval. From a user experience note, I think the conditionals would be better off a different way as well. Your example has this prop_without_confidence :: Property
prop_without_confidence =
withConfidence (10^9) . withTests 1000000 . property $ do
number <- forAll (Gen.int $ Range.linear 1 10)
cover 60 "number == 1" $ number == 1 To me this says, run the test at the very least 1000000 times (because I want to check it that well) while also ensuring confidence of 10^9. But the way you've done the short circuit breaks that minimum test requirement, it could essentially be over in 50 tests, which isn't cool IMO. |
When it comes to the amount of tests run - we can simply change the implementation to adhere to The current implementation simply mimics what QC does for its confidence tests - totally open to changing that. When it comes to the Wilson score - from what I can tell, it's an improvement to the normal approximation especially when run with lower amount of sampling. But I'm no expert. Since it's whatever for you - then I propose we keep as is. In the end, I'm not sure what your review boils down to when saying "I don't think the statistical reasoning is correct". What changes would you like to see in order for us to make progress on this? :) |
-- run tests until confidence has been reached, or fail if cannot be reached:
prop_p1 =
withConfidence (10^9) . property $ do
...
-- run at least X tests and present confidence:
prop_p2 =
withTests X . property $ do
...
-- run at least X times and present confidence, failing if test hasn't reached it:
prop_p3 =
withConfidence (10^9) . withTests X . property $ do
... Example behaviors ☝️ |
You're absolutely right that wasn't very helpful, my apologies. I'll try to come back to this tonight and give more concrete thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about being brusque before.
We seem to have an issue with early stopping, or interim analysis.
It's quite plausible that we'd see the random walk spuriously affecting the results, as we're not taking multiple measurements into account (the metrics include the implicit assumption that we're only testing at the end of the trial).
Most of the literature is through the medical research field, but papers like https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5052936/ might be relevant.
I'll come back to this with some more suggestions later.
(fromIntegral $ unCoverCount labelAnnotation) | ||
(fromIntegral tests) | ||
(1 / fromIntegral (unConfidence confidence)) | ||
>= 0.9 * (unCoverPercentage labelMinimum / 100.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wary of this 0.9; it seems very strange. Could you please justify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a tolerance - we could include that in the confidence data type. That would be clearer but makes it a bit more annoying to deal with. I'll extract it to a hardcoded variable and we can discuss how to make the API for changing the confidence more ergonomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixmulder, @HuwCampbell, perhaps there's something off with the above calculation? See test below:
prop_with_confidence :: Property
prop_with_confidence =
verifiedTermination . withConfidence (10^9) . withTests 1000000 . property $ do
number <- forAll (Gen.int $ Range.constant 1 2)
cover 50 "number == 1" $ number == 1
cover 50 "number == 2" $ number == 2
This will run forever, even though the actual coverage has been reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this snippet is outdate, but to me it looks that it's around this part that the issue happens.
confidenceSuccess
and confidenceFailure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "whole" confidence interval must be greater than the desired coverage. Here, the more tests we run just shrinks the interval from like 49.9% to 50.1% to 49.99% to 50.01%.
With more tests we just get more 9s.
Put another way, the cover query states that the lower bound must be greater than the desired coverage, which it's clearly not. The 49.99999% is never getting above 50%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It almost looks as QuickCheck's checkCoverage
(when using cover
) is far less stringent in order to improve performance.
Here's a side-by-side comparison of Hedgehog and QuickCheck with regards to this. If I use the exact same percentage, Hedgehog runs forever, while QuickCheck terminates.
@HuwCampbell, I would be very much interested in your thoughts on this. Thank you.
|
||
wilsonLowerBound :: Integer -> Integer -> Double -> Double | ||
wilsonLowerBound k n a = | ||
wilson k n $ invnormcdf (a / 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right calling convention for wilson
. z
is the probit for the error bounds, and is included in the denominator of wilson as z^2
. I think you still need to use 1 - a / 2
, but with a negative sign, in wilson
, not a plus.
This is throwing off the lower bound by quite a margin, especially with confidence intervals around 10^9.
Comparing with a beta distribution with alpha 5, beta 10, and 10^7, it's 0.045 when it should be more like 0.008.
-- for 1 in 10^9 tests. | ||
newtype Confidence = | ||
Confidence { | ||
unConfidence :: Integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we're talking age of the universe scales at 10^18.
It looks like quickcheck somewhat sidesteps this by only running the statistical test on tests 200, 400, 800, etc. I had a look at QC's code... I'll think about it harder but I do believe they've misimplemented their lower bound calculation as well. |
Ok, I believe I have figured out what's going on. I still think that the way that the Wilson function is being called is incorrect (or at least extremely counter-intuitive), but the symmetry of I would personally like to see the type signature changed to
with the plus/minus made explicit. After that if we tackle the early stopping problem I think this will be in good shape. |
Cool! Thanks for the comments @HuwCampbell - I've been pretty busy, but I guess we're in no rush. I'll try to get to this some time next week :) |
@HuwCampbell: took care of the early stops, now it works as advertised above. I.e:
With regards to changing the signature to: wilsonBounds :: Integer -> Integer -> Double -> (Double, Double) I'm trying to grok a how you'd like that to look, would the returned tuple be the lower and upper bound? If so - the implementation doesn't really use them together (therefore the split as However, I made the negative sign explicit :) |
Yes, that's what I was thinking, I don't think it matters that they're not used together, Haskell is lazy, so it won't do extra work. |
Fair point 😄 Should be gtg now then! |
Ping pong @HuwCampbell :) |
Would love to see this land! ❤️ |
This also fixes an issue where tests would give false negatives when coverage was below the specified percentages. What was happening was that we would check if the result was verified - it would say no and thus give us an incorrect error message: "Insufficent coverage."
…sts. This isn't technically correct, but it's quite a lot better than not doing it.
3d3b794
to
edb9fcc
Compare
edb9fcc
to
6ae62a0
Compare
Alright - CI seems happy now. The functionality now behaves like this:
I recognize that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇
Oh my 😭 😄 🎉 |
Yay! |
-- | Configuration for a property test. | ||
-- | ||
data PropertyConfig = | ||
PropertyConfig { | ||
propertyTestLimit :: !TestLimit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this field broke the API between 1.0.1 and 1.0.2 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's MIMA when you need it 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it removed by accident? Also—it's time to start adding some more automated tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything in Hedgehog.Internal.Property
is considered internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it removed for a reason though? (We do try not to break other peoples code if possible.)
This PR adds ability to be able to statistically verify that a result is correct. It does not yet add the ability to cancel running tests if they can never reach the specified coverage.