-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use hypothesis for testing the standard library, falling back to stubs #86275
Comments
Following up on this year's language summit, I would like to propose that we start integrating property tests into the standard library. Zac Hatfield-Dodds (along with myself and Carl Friedrich Bolz-Tereick) have put together this repository of tests that we run against the standard library as is: https://github.com/Zac-HD/stdlib-property-tests Here is the blog post covering the proposal from the language summit: https://pyfound.blogspot.com/2020/05/property-based-testing-for-python.html The biggest challenges here are logistical:
To allay these concerns, I propose that we implement a compatibility interface for hypothesis that uses the third party module when it's installed, but otherwise falls back to stubs. The way I see the stubs working is that At least at first, we won't attempt to add a mandatory PR job that runs with hypothesis installed. Instead, I'd like to run either an optional job on PR or have one or more buildbots that runs the hypothesis tests. I would also like to suggest a policy of including at least one example in each property test, so that on PR at least some of the inputs are tested. |
I would like to have a more thorough discussion about the desirability of using Hypothesis first, since I feel that there is a rather hard "sell" going on. I brought this up in the SC tracker (python/steering-council#65) but I don't want to have the discussion there. Copying some quoted text from there: [me]
[Paul]
I don't even know what property-based testing means (for Python) other than Hypothesis. What other frameworks support that? In any case, nobody has been promoting anything else, but Zac has been on our case for over a year. :-) [me]
[Paul]
I meant in the public API. I don't doubt that for zoneinfo this worked well, but that doesn't prove it's any better than any other framework (maybe you would have been ecstatic if you could have used pytest as well :-). Fundamentally, I am a bit suspicious of Hypothesis' origins, Haskell, which is a language and community that just have a different approach to programming than Python. Don't get me wrong, I don't think there's anything wrong with Haskell (except that I've noticed it's particularly popular with the IQ >= 150 crowd :-). It's just different than Python, and just as we don't want to encourage writing "Fortran or Java in Python", I don't think it's a good idea to recommend writing "Haskell in Python".
Ah, here we get to the crux of the matter. What's a "strategy"? Maybe I would be more inclined to support Hypothesis if it was clearer what it does. I like libraries that do stuff for me that I know how to do myself. With Hypothesis, the deeper I get into the docs, the more I get the impression that there is deep magic going on that a poor sap like myself isn't supposed to understand. When I write @given(st.lists(st.integers)) what does that do? I haven't the foggiest idea, and from the docs I get the impression I'm not supposed to worry about it, and *that* worries me. Clearly it can't be generating all lists of 0 integers followed by all lists of 1 integer followed by all lists of 2 integers ... So in what order does it generate test data? The docs are pretty vague about that. (I'm not playing dumb here. I really have no idea how to go about that, and it seems that this is at the heart of Hypothesis and its' Haskell ancestors. Presumably Ph.D theses went into making this sort of thing work. Perhaps more effort could be expended explaining this part to laypeople?)
I have never tried that sort of thing, so I'll take your word for it. But there are other strategies. It seems that a lot of the strategies are answers to a series of questions, along the lines of "how do I generate valid URLs"; and then "email addresses"; and then "IP addresses"; and then suddenly we find a strategy that generates *functions*, and then we get "recursive things", and it keeps going. Timezone keys?! No wonder you found it easy to use. :-) I think I've heard that there's a strategy somewhere that generates random Python programs. How would that even work? Would it help me find corner cases in the grammar? I suppose we could actually use something like that for the "faster CPython" project, to validate that the optimizer doesn't break things. But how? Do I just write @given(st.python_programs(), st.values())
def test_optimizer(func, arg):
expected = func(arg)
optimized = optimizer(func)(arg)
assert optimized == expected ??? Methinks that the hard work then is writing the python_programs() strategy, so I'd want to understand that before trusting it. I suspect I've repeated myself enough times at this point :-), so I hope I'll get some sort of answer. Somehow it seems in its zeal to "sell" itself as super-easy to use in all sorts of circumstances, Hypotheses has buried the lede -- how do you write effective strategies? |
(I think structuring this as a high-level explanation is clearer than point-by-point replies - it covers most of them, but the logic is hopefully easier to follow) The core idea of Hypothesis is that making random choices is equivalent to parsing a random bitstring: Hypothesis is not the only property-based testing library for Python (there's pytest-quickcheck), but in the same sense that [1] https://github.com/DRMacIver/minithesis So a "strategy" is an object (implementation detail: a parser combinator, instance of SearchStrategy), which tells Hypothesis how to convert a random bytestring into whatever object you've asked for. This is a little unusual, but I haven't found a better way which still makes it easy to mix the strategy primitives with arbitrary user code such as "pick a username, ensure it's in the database, and then pass it to my test method". I share your distaste for "Haskell in Python"; I don't think Hypothesis introduces more of that flavor than Numpy would introduce Fortran idioms - there's occasionally a resemblance, but they're both Python-first libraries and I'd attribute it mostly to the problem being solved. FWIW Hypothesis deliberately rejected most of the design choices of QuickCheck, and implementation-wise has more in common with unixy fuzzing tools like AFL. The strategies in our API can generally be divided into useful primitives (e.g. integers, floats, lists, tuples, sampled_from, one_of) and pre-composed strategies (booleans = sampled_from, dictionaries = map+lists-of-kv-tuples, builds = tuples+fixed_dictionaries+map). We _are_ deliberately vague about the order in which we generate test data, because the distribution is full of heuristics and adapts itself to the behaviour of the test - including over multiple runs, thanks to the database. The exact details are also considered an implementation detail, and subject to change as we discover heuristics and sampling techniques which find bugs faster. The upside here is writing effective strategies is straightforward: I'll take my Hypothesmith [3] project for random Python programs as an example of a complex strategy. It was inspired by CSmith [4] - but is considerably less sophisticated. CSmith constructs semantically valid C programs; Hypothesmith (for now!) constructs a subset of syntatically-valid Python. The basic idea is to "invert" the grammar: start with the root node, recursively construct a valid parse tree, and then serialise it out to a parsable input string. Doing a better job would require more time than I had for a proof-of-concept, but no new insights; and the same applies to defining a strategy for semantically-valid code. Being based on the grammar I'd guess it's unlikely to find corner cases in the grammar, but it might do so in the parser, or via e.g. Your distrust of the python_programs() strategy is entirely reasonable, and I'd also want to understand it. I'd just add that it doesn't need to be elegant, or general, or express all inputs... so long as it can find valid inputs that reveal a bug, more easily or cheaply than users or other testing techniques :-) To test compiler optimizations, yes, once you had the strategy that exact test body would work. An even more effective option is to use "equivalence modulo inputs" [5] to derive [3] https://pypi.org/project/hypothesmith/ ; in action at https://github.com/psf/black/blob/main/fuzz.py I hope that helps; in particular Minithesis should take most of the magic out of it. |
Thanks, I will look into that before stirring up more dust. |
This issue confuses two different testing issues. First is selection of test data before the test versus generation while testing (usually 'randomly') Second is how the function is tested. All tests test some property of the inputs and function, but it is useful to distinguish between testing that a function produces a specific expected output versus testing something more abstract or defined in terms of the inputs. Example 1: 2 + 3 = 5 (a + b = (known) c) versus 2 + 3 = 3 + 2 (a + b = b + a) and (2 + 3) + 4 = 2 + (3 + 4). Example 2: sorted([3,5,1,4]) = [1,3,4,5] versus is_sorted(sorted([random list]). The latter are what hypothesis people mean by 'property tests'. Such property tests can be done with either fixed or random pairs in unittest module tests. The test that a module compiles is a weak property test, but better than nothing. A test that that all lines of code runs at least once without an unexpected exception a stronger property test. I consider it roughly on a par with min(s) <= mean(s) <= max(s). The connection between randomized input testing, such as with hypothesis, and property tests is that with random inputs, one cannot test f(input) against an 'expected' value unless one has an alternate means of producing the expected value. It is often easier to test a parameterized equation or other properties. The problem with random input tests is not that they are 'flakey', but that they are useless unless someone is going to pay attention to failures and try to find the cause. This touches on the difference between regression testing and bug-finding tests. CPython CI is the former, and marred at that by buggy randomly failing tests. My conclusion: bug testing would likely be a good idea, but should be done separate from the CI test suite. Such testing should only be done for modules with an active maintainer who would welcome failure reports. |
Are you saying that random input tests are flaky but that that is not the big problem? In my experience using hypothesis, in practice it is not the case that you get tests that fail randomly. The majority of the time if your code doesn't violate one of the properties, the tests fail the first time you run the test suite (this is particularly true for strategies where hypothesis deliberately makes it more likely that you'll get a "nasty" input by biasing the random selection algorithm in that direction). In a smaller number of cases, I see failures that happen on the second, third or fourth run. That said, if it were a concern that every run of the tests is using different inputs (and thus you might see a bug that only appears once in every 20 runs), it is possible to run hypothesis in a configuration where you specify the seed, making it so that hypothesis always runs the same set of inputs for the same tests. We can disable that on a separate non-CI run for hypothesis "fuzzing" that would run the test suite for longer (or indefinitely) looking for long-tail violations of these properties. I feel that if we don't at least run some form of the hypothesis tests in CI, there will likely be bit rot and the tests will decay in usefulness. Consider the case where someone accidentally breaks an edge case that makes it so that
If we run it in CI, the process would be:
Note that in the non-CI process, we need TWO core dev reviews, we need TWO PRs (people are not always super motivated to fix bugs that don't affect them that they the caused when fixing a bug that does affect them), and we need time and effort from the buildbot maintainers (note the same applies even if the "buildbot" is actually a separate process run by Zac out of a github repo). Even if the bug only appears in one out of every 4 CI runs, it's highly likely that it will be found and fixed before it makes it into production, or at least much more quickly, considering that most PRs go through a few edit cycles, and a good fraction of them are backported to 2-3 branches, all with separate CI runs. It's a much quicker feedback loop. I think there's an argument to be made that incorporating more third-party libraries (in general) into our CI build might cause headaches, but I think that is not a problem specific to hypothesis, and I think its one where we can find a reasonable balance that allows us to use hypothesis in one form or another in the standard library. |
Okay, well, I'm trying to understand minithesis.py, but I am despairing. The shrinking code (really everything in class TestingState) is too much to grok. You can tell from the amount of comments that this is the tour-de-force of the code, but the comments don't really help -- they seem to be mostly notes for people trying to reimplement this in another language. (Plus some snide remarks about QuickCheck.) I suppose I have to read the paper at https://drmaciver.github.io/papers/reduction-via-generation-preview.pdf ... At least that has some drawings. I am still unclear on some basics too, but presumably it's all related. PS. What does "shortlex" mean? |
Shrinking is probably not important to understand, since it's *not* particularly related to anything but implementation details. The abstract of that paper explains why the design is important - integrated shrinking means we can report minimal (easier-to-understand) failing examples with zero user effort, and guarantee that all shrunk examples could have been generated. TestingState is, roughly, the The important concept is that input examples are generated by a Possibility (ie Strategy), which internally makes a sequence of choices; and that sequence is sourced from a maybe-empty known prefix (e.g. replaying failures, mutating previous inputs) with the suffix determined by a PRNG. I feel like I'm explaining this badly, but there isn't a simple underlying thing to explain - Hypothesis works well because the user API is relatively simple, and then under the hood we have this many-layered stack of implementation details. The synergies and coverage of otherwise-pathological cases this delivers is excellent, but "multiple years and PhDs worth of detailed synergistic work" is just intrinsically complicated.
Sort by length, and among equal-length strings sort in the normal (alphabetical) order. For example: A < XY < ABC, etc. |
Few things incite me quite as much as being told that I'm not supposed to understand something, so now I'm actually making an effort to follow the ECOOP paper. Honestly, it's not too bad (a lot more readable than type theory papers :-). I'll get back to you once I'm more comfortable in mapping between that and minithesis.py. |
I do not want to dissuade you from figuring out how minithesis / hypothesis works (far from it), but I'm wondering if the question of how shrinking works is germane to the issue at hand, which is whether or not hypothesis / property-based-testing is suitable for testing the standard library. I almost don't think it matters *if* shrinking works, much less *how*. Shrinking is something of a UI nicety in the sense that when hypothesis finds an example that violates a property, it will try to take whatever example it has chosen and find something like a "minimal working example" to show to the end user. So if we have something like:
Hypothesis will presumably tell us that Anyway, do not let me disrupt your process, I just thought it might be worth making the point that some of these specific details might be nice to know, but don't seem like they should be blockers for hypothesis' adoption in the standard library. |
It is important to me, because I don't like having a dependency on |
Having read the parts of the paper that explain shortening, things are making more sense now. I have also successfully contributed type annotations to minithesis! :-) |
I withdraw my objection against Hypothetlsis per se. Others should debate |
Here's my two cents worth on the subject.
TL;DR I really like Hypothesis but think it doesn't belong in standard library tests. |
The suggestion I've made here is that we use @example decorators to take the hypothesis tests you would have written already and turn them in to what are essentially parameterized tests. For anyone who doesn't want to explicitly run the hypothesis test suite, the tests you are apparently already writing would simply turn into normal tests for just the edge cases. One added benefit of keeping the tests around in the form of property tests is that you can run these same tests through hypothesis to find regressions in bugfixes that are implemented after landing (e.g. "Oh we can add a fast path here", which introduces a new edge case). The segfault bug from bpo-34454, for example, would have been found if I had been able to carry over the hypothesis-based tests I was using during the initial implementation of fromisoformat into later stages of the development. (Either because I didn't run it long enough to hit that particular edge case or because that edge case didn't come up until after I had moved the development locus into the CPython repo, I'm not sure). Another benefit of keeping them around is that they become fuzz targets, meaning people like oss-fuzz or anyone who wants to throw some fuzzing resources at CPython have an existing body of tests that are expected to pass on *any* input, to find especially obscure bugs.
I don't really think it's a good argument to say that it hasn't turned up useful bugs. Most of the bugs in a bit of code will be found during development or during the early stages of adoption, and we have very wide adoption. I've found a number of bugs in zoneinfo using hypothesis tests, and I'd love to continue using them in CPython rather than throwing them away or maintaining them in a separate repo. I also think it is very useful for us to write tests about the properties of our systems for re-use in PyPy (which does use hypothesis, by the way) and other implementations of Python. This kind of "define the contract and maintain tests to enforce that" is very helpful for alternate implementations.
I do not think that we need to make hypothesis tests mandatory. They can be used when someone finds them useful.
Even if this were the only time that hypothesis were useful (I don't think it is), some of these round-trips can be among the trickiest and important code to test, even if it's a small fraction of the tests. We have a bunch of functions that are basically "Load this file format" and "Dump this file format", usually implemented in C, which are a magnet for CVEs and often the target for fuzz testing for that reason. Having a small library of maintained tests for round tripping file formats seems like it would be very useful for people who want to donate compute time to fuzz test CPython (or other implementations!)
As mentioned in the initial ticket, the current plan I'm suggesting is to have fallback stubs which turn your property tests into parameterized tests when hypothesis is not installed. If we're good about adding My suggestion is:
With the stubs, these tests won't be any slower than other tests, and I think a single "hypothesis-only" CI job done in parallel with the other jobs won't break the compute bank. Notably, we can almost certainly disable coverage detection on the edge-case-only job as well, which I believe adds quite a bit of overhead.
Again I don't think hypothesis tests are *mandatory* or that we plan to replace our existing testing framework with hypothesis tests. The median newcomer won't need to know anything about hypothesis testing any more than they need to know the details of the grammar or of the ceval loop to make a meaningful contribution. I don't even expect them to make up a large fraction of our tests. I'm a pretty big fan of hypothesis and used it extensively for zoneinfo, and yet only 13% of the tests (by line count) are property tests: $ wc -l tests/test_zoneinfo*
325 tests/test_zoneinfo_property.py
2138 tests/test_zoneinfo.py
2463 total I will not deny that there *are* costs to bringing this into the standard library, but I that they can largely be mitigated. The main concern is really ensuring that we can bring it on board without making our CI system more fragile and without shifting much (if any) concern about fixing bitrot associated with the hypothesis tests onto the buildbot maintainers and release managers. |
In order to understand what Paul is concretely proposing, I read and partly reviewed his PR. I thought about maintainability. Part A adds test.support.hypothesis_helper, similar in purpose to other specialized xyz_helper modules. Hypothesis_helper imports either hypothesis itself or a replacement 'as hypothesis'. The replacement is currently called '_hypothesis_stub', but its name is irrelevant as it is normally not seen by the user. Part B adds a second zoneinfo test module, one that uses hypothesis. Besides further testing zoneinfo, it serves to test hypothesis_helper and partly justify its addition. It starts with 'from hypothesis_helper import hypothesis. The new test module has tests such as
@given always (Paul?) runs the examples as subtests. When the replacement is imported, no randomized examples are added. If some year an example were to fail, could a hypothesis-ignorant zoneinfo-knowing person deal with the failure? I believe so, knowing that the above is equivalent to a test with "for dt in dt_list:\n with self.subTest(...):\n <body of function>" Why not require such a rewriting? Some reasons against: Rewriting by hand can lead to errors. test_utc would have to be split into test_utc_ran(dom) and test_utc_pre(set). Code would have to be duplicated unless factored into a third function. For and with together add two indents, which sometimes squeezes assert onto more lines. I believe that when hypothesis is present, there are advantages to including preset examples with given-generated examples. Paul would like the PR to be a 'camel's nose' in the CPython tent. This cuts both ways. I think this PR should be judged on its own merits. Opening possibilities can be a merit as long as not seen as pre-approval. Our CI testing and buildbots are already configured to blame and annoy the wrong people for random failures. I don't want more unless failure notices are sent to someone responsible for the failing code. |
I've recently had [1] reported, which makes https://bugs.python.org/issue45738 the *third* parser bug [2] that Hypothesmith caught after release, and the second in a stable version - so I'll be maintaining a workaround for some time. I remain happy to help run these tests upstream where they would impact fewer people. [1] Zac-HD/hypothesmith#16 and psf/black#2592 (comment) |
…, with zoneinfo tests (#22863) These are stubs to be used for adding hypothesis (https://hypothesis.readthedocs.io/en/latest/) tests to the standard library. When the tests are run in an environment where `hypothesis` and its various dependencies are not installed, the stubs will turn any tests with examples into simple parameterized tests and any tests without examples are skipped. It also adds hypothesis tests for the `zoneinfo` module, and a Github Actions workflow to run the hypothesis tests as a non-required CI job. The full hypothesis interface is not stubbed out — missing stubs can be added as necessary. Co-authored-by: Zac Hatfield-Dodds <zac.hatfield.dodds@gmail.com>
* main: pythongh-91896: Fixup some docs issues following ByteString deprecation (python#104422) pythonGH-104371: check return value of calling `mv.release` (python#104417) pythongh-104415: Fix refleak tests for `typing.ByteString` deprecation (python#104416) pythonGH-86275: Implementation of hypothesis stubs for property-based tests, with zoneinfo tests (python#22863) pythonGH-103082: Filter LINE events in VM, to simplify tool implementation. (pythonGH-104387) pythongh-93649: Split gc- and allocation tests from _testcapimodule.c (pythonGH-104403) pythongh-104389: Add 'unused' keyword to Argument Clinic C converters (python#104390) pythongh-101819: Prepare _io._IOBase for module state (python#104386) pythongh-104413: Fix refleak when super attribute throws AttributeError (python#104414) Fix refleak in `super_descr_get` (python#104408) pythongh-87526: Remove dead initialization from _zoneinfo parse_abbr() (python#24700) pythongh-91896: Improve visibility of `ByteString` deprecation warnings (python#104294) pythongh-104371: Fix calls to `__release_buffer__` while an exception is active (python#104378) pythongh-104377: fix cell in comprehension that is free in outer scope (python#104394) pythongh-104392: Remove _paramspec_tvars from typing (python#104393) pythongh-104396: uuid.py to skip platform check for emscripten and wasi (pythongh-104397) pythongh-99108: Refresh HACL* from upstream (python#104401) pythongh-104301: Allow leading whitespace in disambiguated pdb statements (python#104342)
Run hypothesis tests in parallel
Refs #107862 |
Is there need for this issue to stay open, or can the follow-up be handled in #107862? |
I think we can close this :) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: