Statistically verify result #288

felixmulder · 2019-05-11T16:15:47Z

This PR adds ability to be able to statistically verify that a result is correct. It does not yet add the ability to cancel running tests if they can never reach the specified coverage.

felixmulder · 2019-05-11T16:51:08Z

Eum, turned out to be trivial to add support for aborting a test that cannot satisfy coverage. So added that 🎉

felixmulder · 2019-05-11T16:57:37Z

hedgehog/src/Hedgehog/Internal/Runner.hs

+            "Test coverage cannot be reached, aborted"
+            Nothing
+            []
+
      else if tests >= fromIntegral (propertyTestLimit cfg) then
        -- we've hit the test limit


One thing to note here is that just as we can stop early if we'll never reach the desired coverage - we can also stop early if we've already reached it!

I guess that is sort of a design decision - should we stop if we've already reached the desired coverage?

Actually - turned out that there was a bug hiding if we didn't do this. Fixed it - I'll give this a rest now (oh and the build failure above was spurious due to not being able to pull packages)

jacobstanley · 2019-05-14T12:24:17Z

hedgehog/src/Hedgehog/Internal/Property.hs

+--   for 1 in 10^9 tests.
+newtype Confidence =
+  Confidence {
+    unConfidence :: Integer


Do you envision using this with confidence greater than 10^18? (Int64)

QuickCheck gives a pretty good explanation of this in their docs:

If you are using checkCoverage as part of a test suite, you should be careful not to set certainty too low. If you want, say, a 1% chance of a false positive during a project's lifetime, then certainty should be set to at least 100 * m * n, where m is the number of uses of cover in the test suite, and n is the number of times you expect the test suite to be run during the project's lifetime. The default value is chosen to be big enough for most projects.

So for instance - let's say we're writing a big project with 100 ocurrences of cover, then we'd get:

100 * 100 * n = confidence

Let's say that we expect to run the project 20 times per day per developer with 5 developers. That would make it:

100 * 100 * (20 * 5 * daysWithoutFalsePositives) = confidence <=> 1 000 000 * daysWithoutFalsePositives = confidence

Let's say we want 5 years with 1% chance of false positives:

1 000 000 * daysWithoutFalsePositives = confidence <=> 1 000 000 * 365 * 5 = 1 825 000 000

Which is veeeeeeery roughly 2*10^9. So yeah, maybe not. Would you rather use Int64?

Yeah we're talking age of the universe scales at 10^18.

jacobstanley · 2019-05-14T12:24:31Z

📈 All looks pretty legit, although the math/stats is a bit beyond me

@HuwCampbell if you're around, would you mind having a look at this?

hcampbell-tyro · 2019-05-15T05:09:47Z

@HuwCampbell if you're around, would you mind having a look at this?

I'll try and have a look before the weekend.

HuwCampbell · 2019-06-08T14:15:37Z

Sorry I don't think the statistical reasoning is correct, you appear to have mixed different metrics inappropriately.

I personally would not have gone with a Wilson metric (but whatever really), and used a Beta distribution prior to approximate the confidence interval.

From a user experience note, I think the conditionals would be better off a different way as well.

Your example has this

prop_without_confidence :: Property
prop_without_confidence =
  withConfidence (10^9) . withTests 1000000 . property $ do
    number <- forAll (Gen.int $ Range.linear 1 10)
    cover 60 "number == 1" $ number == 1

To me this says, run the test at the very least 1000000 times (because I want to check it that well) while also ensuring confidence of 10^9.

But the way you've done the short circuit breaks that minimum test requirement, it could essentially be over in 50 tests, which isn't cool IMO.

felixmulder · 2019-06-09T14:44:23Z

When it comes to the amount of tests run - we can simply change the implementation to adhere to withTests X to run and present the certainty after it's run. We might also want to have a mode to run "until confidence fulfilled". WDYT?

The current implementation simply mimics what QC does for its confidence tests - totally open to changing that.

When it comes to the Wilson score - from what I can tell, it's an improvement to the normal approximation especially when run with lower amount of sampling. But I'm no expert. Since it's whatever for you - then I propose we keep as is.

In the end, I'm not sure what your review boils down to when saying "I don't think the statistical reasoning is correct".

What changes would you like to see in order for us to make progress on this? :)

felixmulder · 2019-06-09T14:48:18Z

-- run tests until confidence has been reached, or fail if cannot be reached:
prop_p1 =
  withConfidence (10^9) . property $ do
    ...

-- run at least X tests and present confidence:
prop_p2 =
  withTests X . property $ do
    ...

-- run at least X times and present confidence, failing if test hasn't reached it:
prop_p3 =
  withConfidence (10^9) . withTests X . property $ do
    ...

Example behaviors ☝️

HuwCampbell · 2019-06-10T00:28:08Z

You're absolutely right that wasn't very helpful, my apologies. I'll try to come back to this tonight and give more concrete thoughts.

HuwCampbell

Sorry about being brusque before.

We seem to have an issue with early stopping, or interim analysis.

It's quite plausible that we'd see the random walk spuriously affecting the results, as we're not taking multiple measurements into account (the metrics include the implicit assumption that we're only testing at the end of the trial).

Most of the literature is through the medical research field, but papers like https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5052936/ might be relevant.

I'll come back to this with some more suggestions later.

HuwCampbell · 2019-06-11T03:36:02Z

hedgehog/src/Hedgehog/Internal/Property.hs

+        (fromIntegral $ unCoverCount labelAnnotation)
+        (fromIntegral tests)
+        (1 / fromIntegral (unConfidence confidence))
+      >= 0.9 * (unCoverPercentage labelMinimum / 100.0)


I'm wary of this 0.9; it seems very strange. Could you please justify this?

This is just a tolerance - we could include that in the confidence data type. That would be clearer but makes it a bit more annoying to deal with. I'll extract it to a hardcoded variable and we can discuss how to make the API for changing the confidence more ergonomic.

@felixmulder, @HuwCampbell, perhaps there's something off with the above calculation? See test below:

prop_with_confidence :: Property prop_with_confidence = verifiedTermination . withConfidence (10^9) . withTests 1000000 . property $ do number <- forAll (Gen.int $ Range.constant 1 2) cover 50 "number == 1" $ number == 1 cover 50 "number == 2" $ number == 2

This will run forever, even though the actual coverage has been reached.

I understand that this snippet is outdate, but to me it looks that it's around this part that the issue happens.

confidenceSuccess and confidenceFailure

@felixmulder @HuwCampbell 🏓 🍻

The "whole" confidence interval must be greater than the desired coverage. Here, the more tests we run just shrinks the interval from like 49.9% to 50.1% to 49.99% to 50.01%.

With more tests we just get more 9s.

Put another way, the cover query states that the lower bound must be greater than the desired coverage, which it's clearly not. The 49.99999% is never getting above 50%.

It almost looks as QuickCheck's checkCoverage (when using cover) is far less stringent in order to improve performance.

Here's a side-by-side comparison of Hedgehog and QuickCheck with regards to this. If I use the exact same percentage, Hedgehog runs forever, while QuickCheck terminates.

@HuwCampbell, I would be very much interested in your thoughts on this. Thank you.

HuwCampbell · 2019-06-11T04:10:22Z

hedgehog/src/Hedgehog/Internal/Property.hs

+
+wilsonLowerBound :: Integer -> Integer -> Double -> Double
+wilsonLowerBound k n a =
+  wilson k n $ invnormcdf (a / 2)


I don't think this is the right calling convention for wilson. z is the probit for the error bounds, and is included in the denominator of wilson as z^2. I think you still need to use 1 - a / 2, but with a negative sign, in wilson, not a plus.

This is throwing off the lower bound by quite a margin, especially with confidence intervals around 10^9.

Comparing with a beta distribution with alpha 5, beta 10, and 10^7, it's 0.045 when it should be more like 0.008.

HuwCampbell · 2019-06-11T04:31:16Z

hedgehog/src/Hedgehog/Internal/Property.hs

+--   for 1 in 10^9 tests.
+newtype Confidence =
+  Confidence {
+    unConfidence :: Integer


Yeah we're talking age of the universe scales at 10^18.

HuwCampbell · 2019-06-11T07:34:22Z

It's quite plausible that we'd see the random walk spuriously affecting the results, as we're not taking multiple measurements into account (the metrics include the implicit assumption that we're only testing at the end of the trial).

It looks like quickcheck somewhat sidesteps this by only running the statistical test on tests 200, 400, 800, etc.

I had a look at QC's code... I'll think about it harder but I do believe they've misimplemented their lower bound calculation as well.

HuwCampbell · 2019-06-12T02:37:13Z

Ok, I believe I have figured out what's going on.

I still think that the way that the Wilson function is being called is incorrect (or at least extremely counter-intuitive), but the symmetry of invnormcdf around 0.5 cancels it out completely.

I would personally like to see the type signature changed to

wilsonBounds :: Integer -> Integer -> Double -> (Double, Double)

with the plus/minus made explicit.

After that if we tackle the early stopping problem I think this will be in good shape.

felixmulder · 2019-06-17T19:37:46Z

Cool! Thanks for the comments @HuwCampbell - I've been pretty busy, but I guess we're in no rush. I'll try to get to this some time next week :)

felixmulder · 2019-06-26T19:13:16Z

@HuwCampbell: took care of the early stops, now it works as advertised above. I.e:

If only withTests n is set, exactly n tests will be run
If only withConfidence c is set, tests will be run until the confidence is met or known to never be met
If both withTests n and withConfidence c are used, at least n tests will be run, but the suite will continue until the confidence is reached or deemed to be unreachable

With regards to changing the signature to:

wilsonBounds :: Integer -> Integer -> Double -> (Double, Double)

I'm trying to grok a how you'd like that to look, would the returned tuple be the lower and upper bound? If so - the implementation doesn't really use them together (therefore the split as wilsonLowerBound and wilsonUpperBound).

However, I made the negative sign explicit :)

HuwCampbell · 2019-07-13T02:12:39Z

I'm trying to grok a how you'd like that to look, would the returned tuple be the lower and upper bound? If so - the implementation doesn't really use them together (therefore the split as wilsonLowerBound and wilsonUpperBound).

Yes, that's what I was thinking, I don't think it matters that they're not used together, Haskell is lazy, so it won't do extra work.

felixmulder · 2019-07-13T13:16:19Z

Haskell is lazy, so it won't do extra work.

Fair point 😄

Should be gtg now then!

felixmulder · 2019-07-21T17:31:04Z

Ping pong @HuwCampbell :)

patrickt · 2019-08-06T20:46:33Z

Would love to see this land! ❤️

This also fixes an issue where tests would give false negatives when coverage was below the specified percentages. What was happening was that we would check if the result was verified - it would say no and thus give us an incorrect error message: "Insufficent coverage."

…sion

…sts. This isn't technically correct, but it's quite a lot better than not doing it.

felixmulder · 2019-09-26T20:18:04Z

Alright - CI seems happy now.

The functionality now behaves like this:

withConfidence adds confidence checking to the result of test run (but obey the withTests minTests parameter
verifiedTermination will run at least 100 tests and then check every 100 tests if success is verified or if failure is verified, this means that is might run less than the tests specified with withTests, but at least 100 tests for sure

I recognize that verifiedTermination might not be the best name. But if you want to change it - please just amend the PR and merge it 💃

jacobstanley

🥇

felixmulder · 2019-10-10T18:58:13Z

Oh my 😭 😄 🎉

patrickt · 2019-10-10T20:16:24Z

Yay!

symbiont-sam-halliday · 2020-03-16T14:17:38Z

hedgehog/src/Hedgehog/Internal/Property.hs

 -- | Configuration for a property test.
 --
 data PropertyConfig =
  PropertyConfig {
-      propertyTestLimit :: !TestLimit


removing this field broke the API between 1.0.1 and 1.0.2 😞

Where's MIMA when you need it 😢

Was it removed by accident? Also—it's time to start adding some more automated tests.

I think everything in Hedgehog.Internal.Property is considered internal.

Was it removed for a reason though? (We do try not to break other peoples code if possible.)

felixmulder force-pushed the statistically-verify-result branch from 62e34ec to e5ce474 Compare May 11, 2019 16:19

felixmulder commented May 11, 2019

View reviewed changes

felixmulder force-pushed the statistically-verify-result branch from e1d73b9 to 3efa763 Compare May 11, 2019 17:39

jacobstanley reviewed May 14, 2019

View reviewed changes

HuwCampbell reviewed Jun 11, 2019

View reviewed changes

robrix mentioned this pull request Aug 6, 2019

Source coverage “guarantees” github/semantic#222

Merged

felixmulder added 9 commits September 26, 2019 22:05

Add Confidence to PropertyConfig

a8932aa

Add ability to statistically verify a completed test

603e333

Add example for a test which will fail on confidence violation

0593a00

Abort tests where coverage cannot be reached

4f5faae

Don't stop early if withTests is set

e7e37c3

Make Wilson lower bound negative z explicit

c7e7176

Extract confidence success tolerance to var, pending interface discus…

70ef49b

…sion

Rework wilson calculation to return tuple of Doubles

863173f

HuwCampbell and others added 4 commits September 26, 2019 22:05

Refine wilson bounds function

44af296

Work around multiple measurement error by only measuring every 100 te…

6783fea

…sts. This isn't technically correct, but it's quite a lot better than not doing it.

Remove comment regarding tolerance that's no logner applicable

8cdb44e

Make API a bit more straight-forward for confidence based stopping

6613dd9

felixmulder force-pushed the statistically-verify-result branch from 3d3b794 to edb9fcc Compare September 26, 2019 20:05

Rename terminateEarly to verifiedTermination

6ae62a0

felixmulder force-pushed the statistically-verify-result branch from edb9fcc to 6ae62a0 Compare September 26, 2019 20:09

jacobstanley approved these changes Oct 10, 2019

View reviewed changes

jacobstanley merged commit 05a99d2 into hedgehogqa:master Oct 10, 2019

felixmulder deleted the statistically-verify-result branch October 10, 2019 18:58

symbiont-sam-halliday reviewed Mar 16, 2020

View reviewed changes

dwijnand mentioned this pull request Jul 15, 2020

Statistically verify result (aka verifiedTermination/withConfidence) hedgehogqa/scala-hedgehog#162

Open

moodmosaic mentioned this pull request Jun 7, 2022

Observe test case distribution #127

Closed

This was referenced Aug 9, 2022

Add classifications to properties #253

Merged

Observing Test Case Distribution hedgehogqa/fsharp-hedgehog#25

Open

Statistically verify result #288

Statistically verify result #288

Conversation

felixmulder commented May 11, 2019

felixmulder commented May 11, 2019

Choose a reason for hiding this comment

felixmulder May 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobstanley commented May 14, 2019

hcampbell-tyro commented May 15, 2019

HuwCampbell commented Jun 8, 2019

felixmulder commented Jun 9, 2019

felixmulder commented Jun 9, 2019

HuwCampbell commented Jun 10, 2019

HuwCampbell left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixmulder Jun 26, 2019 • edited Loading

Choose a reason for hiding this comment

moodmosaic Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuwCampbell commented Jun 11, 2019

HuwCampbell commented Jun 12, 2019

felixmulder commented Jun 17, 2019

felixmulder commented Jun 26, 2019

HuwCampbell commented Jul 13, 2019

felixmulder commented Jul 13, 2019

felixmulder commented Jul 21, 2019

patrickt commented Aug 6, 2019

felixmulder commented Sep 26, 2019 • edited Loading

jacobstanley left a comment

Choose a reason for hiding this comment

felixmulder commented Oct 10, 2019

patrickt commented Oct 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moodmosaic Mar 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixmulder May 11, 2019 •

edited

Loading

HuwCampbell left a comment •

edited

Loading

felixmulder Jun 26, 2019 •

edited

Loading

moodmosaic Aug 8, 2022 •

edited

Loading

felixmulder commented Sep 26, 2019 •

edited

Loading

moodmosaic Mar 17, 2020 •

edited

Loading