Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

constrain generation better to filter less? #7

Open
carljm opened this issue May 19, 2023 · 2 comments
Open

constrain generation better to filter less? #7

carljm opened this issue May 19, 2023 · 2 comments

Comments

@carljm
Copy link
Owner

carljm commented May 19, 2023

We still filter out a lot of generated samples as non-compilable or not containing a listcomp. If we could constrain generation better and filter less (without excluding interesting examples), it would make the fuzzer run a lot faster.

@JelleZijlstra
Copy link
Collaborator

I looked at this a little bit (by running it and printing out all the rejected code). We actually don't generate that much code that doesn't compile; there's occasionally some with an invalid nonlocal or walrus. However, we generate a lot of code that doesn't contain listcomps, often because it contains lots of global and nonlocal statements instead.

I tried locally to improve things by constraining the presence of nonlocal and global more (so there's less nonlocal __a, __b), but it seems to have made things worse.

@carljm
Copy link
Owner Author

carljm commented May 20, 2023

I think I've identified the problem here, but haven't identified the solution yet. I made some changes to try to generate fewer invalid examples (see #8), but I still get hypothesis stats looking like this (on a run of just max_examples=100):

test_fuzz_comps.py::test_comprehension:

  - during reuse phase (0.64 seconds):
    - Typical runtimes: ~ 25-101 ms, of which ~ 18-78 ms in data generation
    - 3 passing examples, 0 failing examples, 7 invalid examples

  - during generate phase (117.75 seconds):
    - Typical runtimes: ~ 27-219 ms, of which ~ 28-221 ms in data generation
    - 8 passing examples, 0 failing examples, 982 invalid examples
    - Events:
      * 0.20%, Retried draw from builds(Module, ListStrategy(one_of(classes(), functions()), min_size=1, max_size=1)).filter(compilable) to satisfy filter
      * 0.20%, not compilable

  - Highest target scores:
                   0  (label='(modules) number of lambdas')
                   1  (label='(modules) number of classes')
                   1  (label='(modules) number of functions')
                   3  (label='(modules) number of listcomps')
  - Stopped because settings.max_examples=100, but < 10% of examples satisfied assumptions

You can see from the events (and I confirmed by printing uncompilable examples) that we hardly filter anything due to incompilability anymore (and we filter out nothing due to no-listcomps), and yet we still have ~98% invalid examples. Why?

I turned on debug logging in Hypothesis (pytest --hypothesis-verbosity=debug -s), and during the times when we aren't generating any valid samples, I see tons of debug output lines like this:

109 bytes [[0, [0, [[1, 1], 0], [1, [2, [1, [[1, 0], [1, 0], 0], [[1, 1], 0], [1, [3, [0, [[1, 1], 0], [[[3, 2, 1], [1, [5, 0]]], [[3, 2, 1], [3, [1, [[1, 1], 0], [[3, 0], [0, [[[1, [0, 0]], 0], [3, [1, [6, [2, [[3, 1], [[3, 2], [[4, [[0, [13, 3]], [[[2, 1], 1], [[2, 1], 1], 0]]], [1, [[0, 0], [1, [1, [[[3, 0], 1], [6, [[0, 4], [1, [[0, 1], [1, [1, [[3, 2], [[2, [[1, [[[3, 0], 0], [1, 0]]], [0, [11, 6]]]], [1, [[0, 1], [3, [[4, [[1, 1], [3, 2, 0]]], [1, [1, [[0, 1], [[7, 2], [2, [7, 3]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] -> Status.INVALID,

I dug into hypothesis' source code and some extra debug prints confirmed that each of those lines is a time that we hit this case: https://github.com/HypothesisWorks/hypothesis/blob/master/hypothesis-python/src/hypothesis/internal/conjecture/data.py#L941

So it seems like internally, hypothesis has a hard-coded max depth of 100 (https://github.com/HypothesisWorks/hypothesis/blob/master/hypothesis-python/src/hypothesis/internal/conjecture/data.py#L729), and when you hit that max depth, it starts silently discarding samples, in a way that's effectively invisible in the stats.

I expect the reason for this is that our strategies are too recursive, but I'm not sure how to fix this. Maybe using strategies.recursive in some places can help? Although I'm not sure how to apply it to mutual recursion of two different strategies. I think I might just want to increase the max depth and let it generate deeper trees? But it seems this will require patching hypothesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants