Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution of floats() has regressed its ability to find bugs since 1.11.0 #469

Closed
alexwlchan opened this issue Feb 19, 2017 · 7 comments
Labels
enhancement it's not broken, but we want it to be better

Comments

@alexwlchan
Copy link
Contributor

Via @pjdelport in IRC:

There’s an example in an old PyCon UK talk that would reliably find a failing test case in Hypothesis 1.11.0:

import math
from hypothesis import given, assume, example
from hypothesis.strategies import lists, floats


def mean(xs):
    return sum(x / len(xs) for x in xs)
    

@given(lists(floats(), min_size=1))
def test_mean(xs):
    assume(not any(math.isnan(x) or math.isinf(x) for x in xs))
    assert min(xs) <= mean(xs) <= max(xs)
xs = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

    @example([1.0] * 6)
    @given(lists(floats(), min_size=1))
    def test_mean(xs):
        assume(not any(math.isnan(x) or math.isinf(x) for x in xs))
>       assert min(xs) <= mean(xs) <= max(xs)
E       assert 1.0 <= 0.9999999999999999
E        +  where 1.0 = min([1.0, 1.0, 1.0, 1.0, 1.0, 1.0])
E        +  and   0.9999999999999999 = mean([1.0, 1.0, 1.0, 1.0, 1.0, 1.0])

means.py:17: AssertionError

If you run this example on Hypothesis 3.6.1, most of the time this won’t trigger a failure – or when you do, it isn’t as nice (xs=[5e-324, 5e-324] has been found on three independent systems).

According to the changelog, the distribution of floats() changed between 1.11.0 and 1.11.1 – and I can’t trigger the example in 1.11.1.

It would be good to understand what caused this example to fall off the radar – and in particular, whether there are other floating-point bugs that are no longer being caught.

@Zac-HD
Copy link
Member

Zac-HD commented Mar 23, 2017

I think you're actually identifying two separate issues:

  1. The simplified example is harder to understand

Minimisation is hard, etc. - but I actually think the xs=[5e-324, 5e-324] example is clearer. With ones, you have to understand a buildup of imprecision; with large values you have to recognise overflow to infinity. Basically, I think the 'shorter lists are better' heuristic is a good one.

E       assert inf <= 5e-324
E        +  where 5e-324 = min([5e-324, 5e-324])
E        +  and   inf = mean([5e-324, 5e-324])

This doesn't look like an issue to me.

  1. Failing examples are much less common than in earlier versions

While we obviously want failing examples, I think the effectiveness of the distribution of examples is influenced more by the test function than the distribution. If the new distribution is better at finding logic bugs but worse at finding floating-point edge cases, IMO that's a net win. Of course, it would be best to do both, so the real question is 'what parts of the floating-point distribution are not being explored any more?'

@DRMacIver
Copy link
Member

Yeah, now that I think about it (@alexwlchan and I had talked about this before filing), the failing test case presented is actually absolutely correct by modern Hypothesis's heuristics for less data always being better - it was correct on the old heuristics as well, but due to reasons old Hypothesis would never have found that shrink.

RE the distribution: It would not surprise me to learn that the distribution of floating point numbers has got worse in some manner, but the problem is that we don't currently have any sort of good empirical data about what sort of floating point bugs people actually care about and how to trigger them, so the distribution is rather guess work. I'm definitely happy to consider finding this bug less reliably as at least suggestive of an error, if not concrete proof.

@Zac-HD
Copy link
Member

Zac-HD commented Sep 8, 2017

Closed by #816?

@DRMacIver
Copy link
Member

The minimization quality part is. I don't think the bug finding part is though (it might - integer floats are now significantly more likely, which I've only just realised I should have highlighted - but I don't know if that's enough.

@Zac-HD
Copy link
Member

Zac-HD commented Oct 4, 2017

@given(lists(floats(allow_infinity=False, allow_nan=False, min_value=1), min_size=1))
def test_mean(xs):
    mean = sum(x / len(xs) for x in xs)
    assert min(xs) <= mean <= max(xs), mean

With Hypothesis 3.31.2, this code always finds the 6 * [1.0] counterexample - but if the min_value=1 is removed, it doesn't. Setting min_value=0.9 finds a list of 0.9 as the falsifying example; setting it to zero finds no falsifying example at all.

@Zac-HD
Copy link
Member

Zac-HD commented Oct 19, 2018

I haven't done a full sensitivity analysis, but in practice to trigger this assertion we have to generate a list of length >=6 consisting of a single floating-point value. The following test reliably finds the [1.0] * 6 example as of Hypothesis 3.78.0, and is numerically equivalent:

@given(integers(1, 100), floats(allow_infinity=False, allow_nan=False, min_value=1))
def test_mean_effective(n, x):
    mean = sum(x / n for _ in range(n))
    assert x == mean

So implementing swarm testing (#1637) will fix this but anything less probably won't.

@Zac-HD Zac-HD added the enhancement it's not broken, but we want it to be better label Oct 19, 2018
@Zac-HD
Copy link
Member

Zac-HD commented Jul 9, 2019

Closing this issue because I don't think there's a good fix short of swarm testing, and we have #1637 for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better
Projects
None yet
Development

No branches or pull requests

3 participants