Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor observability when hypothesis rejects data due to MAX_DEPTH #3643

Closed
carljm opened this issue May 20, 2023 · 6 comments
Closed

poor observability when hypothesis rejects data due to MAX_DEPTH #3643

carljm opened this issue May 20, 2023 · 6 comments
Labels
legibility make errors helpful and Hypothesis grokable

Comments

@carljm
Copy link
Contributor

carljm commented May 20, 2023

I wrote a "comprehension fuzzer" using hypothesis to fuzz the CPython compiler, specifically on list comprehensions: https://github.com/carljm/compgenerator

This is part of debugging/stabilizing the implementation of PEP 709. The fuzzer contains strategies (inspired by Hypothesmith, but not using it) to generate a valid and constrained Python AST containing comprehensions, and then test its compilability and output in a control and test build of CPython.

I observed that the fuzzer is very slow because it rejects something like 98% of samples it generates. I've done some work in carljm/compgenerator#8 to try to reduce the visible reasons for rejection (my own uses of .filter(...)), but I still end up with stats looking like this:

test_fuzz_comps.py::test_comprehension:

  - during reuse phase (1.00 seconds):
    - Typical runtimes: ~ 24-113 ms, of which ~ 13-54 ms in data generation
    - 10 passing examples, 0 failing examples, 0 invalid examples

  - during generate phase (115.09 seconds):
    - Typical runtimes: ~ 28-214 ms, of which ~ 28-215 ms in data generation
    - 3 passing examples, 0 failing examples, 987 invalid examples
    - Events:
      * 0.10%, Retried draw from builds(Module, ListStrategy(one_of(classes(), functions()), min_size=1, max_size=1)).filter(compilable) to satisfy filter
      * 0.10%, not compilable

  - Highest target scores:
                   1  (label='(modules) number of functions')
                   2  (label='(modules) number of classes')
                   2  (label='(modules) number of lambdas')
                   3  (label='(modules) number of listcomps')
  - Stopped because settings.max_examples=100, but < 10% of examples satisfied assumptions

(If it's useful to reproduce this, use carljm/compgenerator#8 and, after pip install -r frozen.txt in a Python 3.11 venv, run pytest --hypothesis-show-statistics --hypothesis-verbosity=debug -s --run-evalserver. Note that this does evaluate fuzzed AST samples, but the samples are quite constrained; no imports, no calls, no use of any identifier other than __a and __b, which means no references to builtins, so it should be reasonably safe.)

I found this stats output very confusing, because I still have ~99% sample rejection, but basically none of it is coming from any reported events or filtering.

After some debugging (via hypothesis "debug" level verbosity and code spelunking in hypothesis internals), I eventually realized that all of the rejections are coming from https://github.com/HypothesisWorks/hypothesis/blob/master/hypothesis-python/src/hypothesis/internal/conjecture/data.py#L941. It seems the MAX_DEPTH value is hardcoded (https://github.com/HypothesisWorks/hypothesis/blob/master/hypothesis-python/src/hypothesis/internal/conjecture/data.py#L729), and by default there is zero visibility into sample rejection due to reaching MAX_DEPTH.

This issue is not about helping me fix my strategies to avoid these rejections (though of course I'd welcome any suggestions in that direction over on carljm/compgenerator#7).

The issue I'm reporting here is that it would have been nice if hypothesis' stats output would have been more helpful in identifying the problem, without making me resort to debugging internals. This could maybe be as simple as recording an event when MAX_DEPTH is hit?

It also might be nice to have MAX_DEPTH configurable instead of hardcoded to 100, if in my particular case I'm ok with samples that have deeper trees? I could file a separate issue for that, if the suggestion makes any sense.

@carljm carljm changed the title poor observability when hypothesis filters samples due to MAX_DEPTH poor observability when hypothesis rejects data due to MAX_DEPTH May 20, 2023
@Zac-HD Zac-HD added the legibility make errors helpful and Hypothesis grokable label May 22, 2023
@Zac-HD
Copy link
Member

Zac-HD commented May 22, 2023

The issue I'm reporting here is that it would have been nice if hypothesis' stats output would have been more helpful in identifying the problem, without making me resort to debugging internals. This could maybe be as simple as recording an event when MAX_DEPTH is hit?

Sounds great! Grepping through our uses of data.mark_invalid(), I think it might make sense to add a why argument to this method, and data.note_event(why) internally - it makes sense to ensure that we always have an event when we reject a test case. Want to open a PR?

It also might be nice to have MAX_DEPTH configurable instead of hardcoded to 100, if in my particular case I'm ok with samples that have deeper trees?

Since the depth in question doesn't correspond to anything in our user-facing API - in particular, it's not exactly the same as the nesting depth of the produced examples - I'd prefer to keep this as private internals, and work out a better value or how to auto-configure it without needing user intervention.

@Zac-HD
Copy link
Member

Zac-HD commented May 22, 2023

Related issues: Zac-HD/hypothesmith#28, Zac-HD/hypothesmith#30 (comment).

My limited free time permitting, I'm keen to help get better tools for source-code generation working - there's clearly ongoing uses for them 😀

@carljm
Copy link
Contributor Author

carljm commented May 23, 2023

Want to open a PR?

Yes! Will do.

@carljm
Copy link
Contributor Author

carljm commented May 25, 2023

Opened the PR for more events on invalidation.

Since the depth in question doesn't correspond to anything in our user-facing API - in particular, it's not exactly the same as the nesting depth of the produced examples - I'd prefer to keep this as private internals, and work out a better value or how to auto-configure it without needing user intervention.

I'd like to continue this discussion with some more data from my comprehension generator. Would you prefer if we just continue in this issue (to keep all the above context) or that I open a new issue, so this one is just about the observability?

@Zac-HD
Copy link
Member

Zac-HD commented May 25, 2023

Let's go with a new issue, so that this one can be closed by your PR. (for which, thanks!)

@Zac-HD
Copy link
Member

Zac-HD commented May 26, 2023

Closed by #3654 ❤️🎉

@Zac-HD Zac-HD closed this as completed May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
legibility make errors helpful and Hypothesis grokable
Projects
None yet
Development

No branches or pull requests

2 participants