constrain generation better to filter less? #7

carljm · 2023-05-19T15:25:28Z

We still filter out a lot of generated samples as non-compilable or not containing a listcomp. If we could constrain generation better and filter less (without excluding interesting examples), it would make the fuzzer run a lot faster.

JelleZijlstra · 2023-05-20T00:47:35Z

I looked at this a little bit (by running it and printing out all the rejected code). We actually don't generate that much code that doesn't compile; there's occasionally some with an invalid nonlocal or walrus. However, we generate a lot of code that doesn't contain listcomps, often because it contains lots of global and nonlocal statements instead.

I tried locally to improve things by constraining the presence of nonlocal and global more (so there's less nonlocal __a, __b), but it seems to have made things worse.

carljm · 2023-05-20T16:08:10Z

I think I've identified the problem here, but haven't identified the solution yet. I made some changes to try to generate fewer invalid examples (see #8), but I still get hypothesis stats looking like this (on a run of just max_examples=100):

test_fuzz_comps.py::test_comprehension:

  - during reuse phase (0.64 seconds):
    - Typical runtimes: ~ 25-101 ms, of which ~ 18-78 ms in data generation
    - 3 passing examples, 0 failing examples, 7 invalid examples

  - during generate phase (117.75 seconds):
    - Typical runtimes: ~ 27-219 ms, of which ~ 28-221 ms in data generation
    - 8 passing examples, 0 failing examples, 982 invalid examples
    - Events:
      * 0.20%, Retried draw from builds(Module, ListStrategy(one_of(classes(), functions()), min_size=1, max_size=1)).filter(compilable) to satisfy filter
      * 0.20%, not compilable

  - Highest target scores:
                   0  (label='(modules) number of lambdas')
                   1  (label='(modules) number of classes')
                   1  (label='(modules) number of functions')
                   3  (label='(modules) number of listcomps')
  - Stopped because settings.max_examples=100, but < 10% of examples satisfied assumptions

You can see from the events (and I confirmed by printing uncompilable examples) that we hardly filter anything due to incompilability anymore (and we filter out nothing due to no-listcomps), and yet we still have ~98% invalid examples. Why?

I turned on debug logging in Hypothesis (pytest --hypothesis-verbosity=debug -s), and during the times when we aren't generating any valid samples, I see tons of debug output lines like this:

109 bytes [[0, [0, [[1, 1], 0], [1, [2, [1, [[1, 0], [1, 0], 0], [[1, 1], 0], [1, [3, [0, [[1, 1], 0], [[[3, 2, 1], [1, [5, 0]]], [[3, 2, 1], [3, [1, [[1, 1], 0], [[3, 0], [0, [[[1, [0, 0]], 0], [3, [1, [6, [2, [[3, 1], [[3, 2], [[4, [[0, [13, 3]], [[[2, 1], 1], [[2, 1], 1], 0]]], [1, [[0, 0], [1, [1, [[[3, 0], 1], [6, [[0, 4], [1, [[0, 1], [1, [1, [[3, 2], [[2, [[1, [[[3, 0], 0], [1, 0]]], [0, [11, 6]]]], [1, [[0, 1], [3, [[4, [[1, 1], [3, 2, 0]]], [1, [1, [[0, 1], [[7, 2], [2, [7, 3]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] -> Status.INVALID,

I dug into hypothesis' source code and some extra debug prints confirmed that each of those lines is a time that we hit this case: https://github.com/HypothesisWorks/hypothesis/blob/master/hypothesis-python/src/hypothesis/internal/conjecture/data.py#L941

So it seems like internally, hypothesis has a hard-coded max depth of 100 (https://github.com/HypothesisWorks/hypothesis/blob/master/hypothesis-python/src/hypothesis/internal/conjecture/data.py#L729), and when you hit that max depth, it starts silently discarding samples, in a way that's effectively invisible in the stats.

I expect the reason for this is that our strategies are too recursive, but I'm not sure how to fix this. Maybe using strategies.recursive in some places can help? Although I'm not sure how to apply it to mutual recursion of two different strategies. I think I might just want to increase the max depth and let it generate deeper trees? But it seems this will require patching hypothesis.

carljm mentioned this issue May 20, 2023

fixes to filter out fewer examples #8

Merged

carljm mentioned this issue May 20, 2023

poor observability when hypothesis rejects data due to MAX_DEPTH HypothesisWorks/hypothesis#3643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

constrain generation better to filter less? #7

constrain generation better to filter less? #7

carljm commented May 19, 2023

JelleZijlstra commented May 20, 2023

carljm commented May 20, 2023

constrain generation better to filter less? #7

constrain generation better to filter less? #7

Comments

carljm commented May 19, 2023

JelleZijlstra commented May 20, 2023

carljm commented May 20, 2023