poor observability when hypothesis rejects data due to MAX_DEPTH #3643

carljm · 2023-05-20T16:44:13Z

I wrote a "comprehension fuzzer" using hypothesis to fuzz the CPython compiler, specifically on list comprehensions: https://github.com/carljm/compgenerator

This is part of debugging/stabilizing the implementation of PEP 709. The fuzzer contains strategies (inspired by Hypothesmith, but not using it) to generate a valid and constrained Python AST containing comprehensions, and then test its compilability and output in a control and test build of CPython.

I observed that the fuzzer is very slow because it rejects something like 98% of samples it generates. I've done some work in carljm/compgenerator#8 to try to reduce the visible reasons for rejection (my own uses of .filter(...)), but I still end up with stats looking like this:

test_fuzz_comps.py::test_comprehension:

  - during reuse phase (1.00 seconds):
    - Typical runtimes: ~ 24-113 ms, of which ~ 13-54 ms in data generation
    - 10 passing examples, 0 failing examples, 0 invalid examples

  - during generate phase (115.09 seconds):
    - Typical runtimes: ~ 28-214 ms, of which ~ 28-215 ms in data generation
    - 3 passing examples, 0 failing examples, 987 invalid examples
    - Events:
      * 0.10%, Retried draw from builds(Module, ListStrategy(one_of(classes(), functions()), min_size=1, max_size=1)).filter(compilable) to satisfy filter
      * 0.10%, not compilable

  - Highest target scores:
                   1  (label='(modules) number of functions')
                   2  (label='(modules) number of classes')
                   2  (label='(modules) number of lambdas')
                   3  (label='(modules) number of listcomps')
  - Stopped because settings.max_examples=100, but < 10% of examples satisfied assumptions

(If it's useful to reproduce this, use carljm/compgenerator#8 and, after pip install -r frozen.txt in a Python 3.11 venv, run pytest --hypothesis-show-statistics --hypothesis-verbosity=debug -s --run-evalserver. Note that this does evaluate fuzzed AST samples, but the samples are quite constrained; no imports, no calls, no use of any identifier other than __a and __b, which means no references to builtins, so it should be reasonably safe.)

I found this stats output very confusing, because I still have ~99% sample rejection, but basically none of it is coming from any reported events or filtering.

After some debugging (via hypothesis "debug" level verbosity and code spelunking in hypothesis internals), I eventually realized that all of the rejections are coming from https://github.com/HypothesisWorks/hypothesis/blob/master/hypothesis-python/src/hypothesis/internal/conjecture/data.py#L941. It seems the MAX_DEPTH value is hardcoded (https://github.com/HypothesisWorks/hypothesis/blob/master/hypothesis-python/src/hypothesis/internal/conjecture/data.py#L729), and by default there is zero visibility into sample rejection due to reaching MAX_DEPTH.

This issue is not about helping me fix my strategies to avoid these rejections (though of course I'd welcome any suggestions in that direction over on carljm/compgenerator#7).

The issue I'm reporting here is that it would have been nice if hypothesis' stats output would have been more helpful in identifying the problem, without making me resort to debugging internals. This could maybe be as simple as recording an event when MAX_DEPTH is hit?

It also might be nice to have MAX_DEPTH configurable instead of hardcoded to 100, if in my particular case I'm ok with samples that have deeper trees? I could file a separate issue for that, if the suggestion makes any sense.

The text was updated successfully, but these errors were encountered:

Zac-HD · 2023-05-22T18:03:29Z

The issue I'm reporting here is that it would have been nice if hypothesis' stats output would have been more helpful in identifying the problem, without making me resort to debugging internals. This could maybe be as simple as recording an event when MAX_DEPTH is hit?

Sounds great! Grepping through our uses of data.mark_invalid(), I think it might make sense to add a why argument to this method, and data.note_event(why) internally - it makes sense to ensure that we always have an event when we reject a test case. Want to open a PR?

It also might be nice to have MAX_DEPTH configurable instead of hardcoded to 100, if in my particular case I'm ok with samples that have deeper trees?

Since the depth in question doesn't correspond to anything in our user-facing API - in particular, it's not exactly the same as the nesting depth of the produced examples - I'd prefer to keep this as private internals, and work out a better value or how to auto-configure it without needing user intervention.

Zac-HD · 2023-05-22T18:05:47Z

Related issues: Zac-HD/hypothesmith#28, Zac-HD/hypothesmith#30 (comment).

My limited free time permitting, I'm keen to help get better tools for source-code generation working - there's clearly ongoing uses for them 😀

carljm · 2023-05-23T23:13:02Z

Want to open a PR?

Yes! Will do.

carljm · 2023-05-25T16:20:34Z

Opened the PR for more events on invalidation.

Since the depth in question doesn't correspond to anything in our user-facing API - in particular, it's not exactly the same as the nesting depth of the produced examples - I'd prefer to keep this as private internals, and work out a better value or how to auto-configure it without needing user intervention.

I'd like to continue this discussion with some more data from my comprehension generator. Would you prefer if we just continue in this issue (to keep all the above context) or that I open a new issue, so this one is just about the observability?

Zac-HD · 2023-05-25T17:27:41Z

Let's go with a new issue, so that this one can be closed by your PR. (for which, thanks!)

Zac-HD · 2023-05-26T14:09:13Z

Closed by #3654 ❤️🎉

carljm changed the title ~~poor observability when hypothesis filters samples due to MAX_DEPTH~~ poor observability when hypothesis rejects data due to MAX_DEPTH May 20, 2023

Zac-HD added the legibility make errors helpful and Hypothesis grokable label May 22, 2023

carljm mentioned this issue May 25, 2023

events for more mark_invalid() cases #3654

Merged

Zac-HD closed this as completed May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

poor observability when hypothesis rejects data due to MAX_DEPTH #3643

poor observability when hypothesis rejects data due to MAX_DEPTH #3643

carljm commented May 20, 2023 •

edited

Loading

Zac-HD commented May 22, 2023

Zac-HD commented May 22, 2023

carljm commented May 23, 2023

carljm commented May 25, 2023

Zac-HD commented May 25, 2023

Zac-HD commented May 26, 2023

poor observability when hypothesis rejects data due to MAX_DEPTH #3643

poor observability when hypothesis rejects data due to MAX_DEPTH #3643

Comments

carljm commented May 20, 2023 • edited Loading

Zac-HD commented May 22, 2023

Zac-HD commented May 22, 2023

carljm commented May 23, 2023

carljm commented May 25, 2023

Zac-HD commented May 25, 2023

Zac-HD commented May 26, 2023

carljm commented May 20, 2023 •

edited

Loading