Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistically verify result #288

Merged

Conversation

felixmulder
Copy link
Contributor

This PR adds ability to be able to statistically verify that a result is correct. It does not yet add the ability to cancel running tests if they can never reach the specified coverage.

@felixmulder felixmulder force-pushed the statistically-verify-result branch from 62e34ec to e5ce474 Compare May 11, 2019 16:19
@felixmulder
Copy link
Contributor Author

Eum, turned out to be trivial to add support for aborting a test that cannot satisfy coverage. So added that 🎉

"Test coverage cannot be reached, aborted"
Nothing
[]

else if tests >= fromIntegral (propertyTestLimit cfg) then
-- we've hit the test limit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note here is that just as we can stop early if we'll never reach the desired coverage - we can also stop early if we've already reached it!

I guess that is sort of a design decision - should we stop if we've already reached the desired coverage?

Copy link
Contributor Author

@felixmulder felixmulder May 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - turned out that there was a bug hiding if we didn't do this. Fixed it - I'll give this a rest now (oh and the build failure above was spurious due to not being able to pull packages)

@felixmulder felixmulder force-pushed the statistically-verify-result branch from e1d73b9 to 3efa763 Compare May 11, 2019 17:39
-- for 1 in 10^9 tests.
newtype Confidence =
Confidence {
unConfidence :: Integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envision using this with confidence greater than 10^18? (Int64)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuickCheck gives a pretty good explanation of this in their docs:

If you are using checkCoverage as part of a test suite, you should be careful not to set certainty too low. If you want, say, a 1% chance of a false positive during a project's lifetime, then certainty should be set to at least 100 * m * n, where m is the number of uses of cover in the test suite, and n is the number of times you expect the test suite to be run during the project's lifetime. The default value is chosen to be big enough for most projects.

So for instance - let's say we're writing a big project with 100 ocurrences of cover, then we'd get:

100 * 100 * n = confidence

Let's say that we expect to run the project 20 times per day per developer with 5 developers. That would make it:

100 * 100 * (20 * 5 * daysWithoutFalsePositives) = confidence
<=>
1 000 000 * daysWithoutFalsePositives = confidence

Let's say we want 5 years with 1% chance of false positives:

1 000 000 * daysWithoutFalsePositives = confidence
<=>
1 000 000 * 365 * 5 = 1 825 000 000

Which is veeeeeeery roughly 2*10^9. So yeah, maybe not. Would you rather use Int64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we're talking age of the universe scales at 10^18.

@jacobstanley
Copy link
Member

📈 All looks pretty legit, although the math/stats is a bit beyond me

@HuwCampbell if you're around, would you mind having a look at this?

@hcampbell-tyro
Copy link

@HuwCampbell if you're around, would you mind having a look at this?

I'll try and have a look before the weekend.

@HuwCampbell
Copy link
Member

Sorry I don't think the statistical reasoning is correct, you appear to have mixed different metrics inappropriately.

I personally would not have gone with a Wilson metric (but whatever really), and used a Beta distribution prior to approximate the confidence interval.

From a user experience note, I think the conditionals would be better off a different way as well.

Your example has this

prop_without_confidence :: Property
prop_without_confidence =
  withConfidence (10^9) . withTests 1000000 . property $ do
    number <- forAll (Gen.int $ Range.linear 1 10)
    cover 60 "number == 1" $ number == 1

To me this says, run the test at the very least 1000000 times (because I want to check it that well) while also ensuring confidence of 10^9.

But the way you've done the short circuit breaks that minimum test requirement, it could essentially be over in 50 tests, which isn't cool IMO.

@felixmulder
Copy link
Contributor Author

When it comes to the amount of tests run - we can simply change the implementation to adhere to withTests X to run and present the certainty after it's run. We might also want to have a mode to run "until confidence fulfilled". WDYT?

The current implementation simply mimics what QC does for its confidence tests - totally open to changing that.

When it comes to the Wilson score - from what I can tell, it's an improvement to the normal approximation especially when run with lower amount of sampling. But I'm no expert. Since it's whatever for you - then I propose we keep as is.

In the end, I'm not sure what your review boils down to when saying "I don't think the statistical reasoning is correct".

What changes would you like to see in order for us to make progress on this? :)

@felixmulder
Copy link
Contributor Author

-- run tests until confidence has been reached, or fail if cannot be reached:
prop_p1 =
  withConfidence (10^9) . property $ do
    ...

-- run at least X tests and present confidence:
prop_p2 =
  withTests X . property $ do
    ...

-- run at least X times and present confidence, failing if test hasn't reached it:
prop_p3 =
  withConfidence (10^9) . withTests X . property $ do
    ...

Example behaviors ☝️

@HuwCampbell
Copy link
Member

You're absolutely right that wasn't very helpful, my apologies. I'll try to come back to this tonight and give more concrete thoughts.

Copy link
Member

@HuwCampbell HuwCampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about being brusque before.

We seem to have an issue with early stopping, or interim analysis.

It's quite plausible that we'd see the random walk spuriously affecting the results, as we're not taking multiple measurements into account (the metrics include the implicit assumption that we're only testing at the end of the trial).

Most of the literature is through the medical research field, but papers like https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5052936/ might be relevant.

I'll come back to this with some more suggestions later.

(fromIntegral $ unCoverCount labelAnnotation)
(fromIntegral tests)
(1 / fromIntegral (unConfidence confidence))
>= 0.9 * (unCoverPercentage labelMinimum / 100.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wary of this 0.9; it seems very strange. Could you please justify this?

Copy link
Contributor Author

@felixmulder felixmulder Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a tolerance - we could include that in the confidence data type. That would be clearer but makes it a bit more annoying to deal with. I'll extract it to a hardcoded variable and we can discuss how to make the API for changing the confidence more ergonomic.

Copy link
Member

@moodmosaic moodmosaic Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixmulder, @HuwCampbell, perhaps there's something off with the above calculation? See test below:

prop_with_confidence :: Property
prop_with_confidence =
  verifiedTermination . withConfidence (10^9) . withTests 1000000 . property $ do
    number <- forAll (Gen.int $ Range.constant 1 2)
    cover 50 "number == 1" $ number == 1
    cover 50 "number == 2" $ number == 2

This will run forever, even though the actual coverage has been reached.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this snippet is outdate, but to me it looks that it's around this part that the issue happens.

confidenceSuccess and confidenceFailure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "whole" confidence interval must be greater than the desired coverage. Here, the more tests we run just shrinks the interval from like 49.9% to 50.1% to 49.99% to 50.01%.

With more tests we just get more 9s.

Put another way, the cover query states that the lower bound must be greater than the desired coverage, which it's clearly not. The 49.99999% is never getting above 50%.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost looks as QuickCheck's checkCoverage (when using cover) is far less stringent in order to improve performance.

Here's a side-by-side comparison of Hedgehog and QuickCheck with regards to this. If I use the exact same percentage, Hedgehog runs forever, while QuickCheck terminates.

@HuwCampbell, I would be very much interested in your thoughts on this. Thank you.


wilsonLowerBound :: Integer -> Integer -> Double -> Double
wilsonLowerBound k n a =
wilson k n $ invnormcdf (a / 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right calling convention for wilson. z is the probit for the error bounds, and is included in the denominator of wilson as z^2. I think you still need to use 1 - a / 2, but with a negative sign, in wilson, not a plus.

This is throwing off the lower bound by quite a margin, especially with confidence intervals around 10^9.

Comparing with a beta distribution with alpha 5, beta 10, and 10^7, it's 0.045 when it should be more like 0.008.

-- for 1 in 10^9 tests.
newtype Confidence =
Confidence {
unConfidence :: Integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we're talking age of the universe scales at 10^18.

@HuwCampbell
Copy link
Member

It's quite plausible that we'd see the random walk spuriously affecting the results, as we're not taking multiple measurements into account (the metrics include the implicit assumption that we're only testing at the end of the trial).

It looks like quickcheck somewhat sidesteps this by only running the statistical test on tests 200, 400, 800, etc.

I had a look at QC's code... I'll think about it harder but I do believe they've misimplemented their lower bound calculation as well.

@HuwCampbell
Copy link
Member

Ok, I believe I have figured out what's going on.

I still think that the way that the Wilson function is being called is incorrect (or at least extremely counter-intuitive), but the symmetry of invnormcdf around 0.5 cancels it out completely.

I would personally like to see the type signature changed to

wilsonBounds :: Integer -> Integer -> Double -> (Double, Double)

with the plus/minus made explicit.

After that if we tackle the early stopping problem I think this will be in good shape.

@felixmulder
Copy link
Contributor Author

Cool! Thanks for the comments @HuwCampbell - I've been pretty busy, but I guess we're in no rush. I'll try to get to this some time next week :)

@felixmulder
Copy link
Contributor Author

@HuwCampbell: took care of the early stops, now it works as advertised above. I.e:

  • If only withTests n is set, exactly n tests will be run
  • If only withConfidence c is set, tests will be run until the confidence is met or known to never be met
  • If both withTests n and withConfidence c are used, at least n tests will be run, but the suite will continue until the confidence is reached or deemed to be unreachable

With regards to changing the signature to:

wilsonBounds :: Integer -> Integer -> Double -> (Double, Double)

I'm trying to grok a how you'd like that to look, would the returned tuple be the lower and upper bound? If so - the implementation doesn't really use them together (therefore the split as wilsonLowerBound and wilsonUpperBound).

However, I made the negative sign explicit :)

@HuwCampbell
Copy link
Member

I'm trying to grok a how you'd like that to look, would the returned tuple be the lower and upper bound? If so - the implementation doesn't really use them together (therefore the split as wilsonLowerBound and wilsonUpperBound).

Yes, that's what I was thinking, I don't think it matters that they're not used together, Haskell is lazy, so it won't do extra work.

@felixmulder
Copy link
Contributor Author

Haskell is lazy, so it won't do extra work.

Fair point 😄

Should be gtg now then!

@felixmulder
Copy link
Contributor Author

Ping pong @HuwCampbell :)

@patrickt
Copy link
Contributor

patrickt commented Aug 6, 2019

Would love to see this land! ❤️

@felixmulder felixmulder force-pushed the statistically-verify-result branch from 3d3b794 to edb9fcc Compare September 26, 2019 20:05
@felixmulder felixmulder force-pushed the statistically-verify-result branch from edb9fcc to 6ae62a0 Compare September 26, 2019 20:09
@felixmulder
Copy link
Contributor Author

felixmulder commented Sep 26, 2019

Alright - CI seems happy now.

The functionality now behaves like this:

  • withConfidence adds confidence checking to the result of test run (but obey the withTests minTests parameter
  • verifiedTermination will run at least 100 tests and then check every 100 tests if success is verified or if failure is verified, this means that is might run less than the tests specified with withTests, but at least 100 tests for sure

I recognize that verifiedTermination might not be the best name. But if you want to change it - please just amend the PR and merge it 💃

Copy link
Member

@jacobstanley jacobstanley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@jacobstanley jacobstanley merged commit 05a99d2 into hedgehogqa:master Oct 10, 2019
@felixmulder
Copy link
Contributor Author

Oh my 😭 😄 🎉

@felixmulder felixmulder deleted the statistically-verify-result branch October 10, 2019 18:58
@patrickt
Copy link
Contributor

Yay!

-- | Configuration for a property test.
--
data PropertyConfig =
PropertyConfig {
propertyTestLimit :: !TestLimit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this field broke the API between 1.0.1 and 1.0.2 😞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's MIMA when you need it 😢

Copy link
Member

@moodmosaic moodmosaic Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it removed by accident? Also—it's time to start adding some more automated tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything in Hedgehog.Internal.Property is considered internal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it removed for a reason though? (We do try not to break other peoples code if possible.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants