Property-based testing for the parser #91

Zac-HD · 2020-04-18T12:49:26Z

Hi everyone!

I gave a talk about property-based testing at the Language Summit yesterday, and people seemed interested. Since it's particularly useful when you have two implementations that can be compared, I thought I should drop by and offer to help apply it to help test the new parser 🙂

We've tested against the stdlib and many popular packages - why might this find additional bugs?

Popular and well-written code may have zero examples of many weird syntactic constructs, since it's typically designed for clarity and compatibility. In my demo here, I discovered several strings that can be compiled but not tokenize.tokenized - involving odd use of newlines, backslashes, non-ascii characters, etc.

It's also nice to automatically get minimal examples of any failing inputs.

Say we decide to use this. What's the catch?

I'll assume here that you're familiar with the basic idea of Hypothesis (e.g. saw my talk) - if not, the example above should be helpful and I've collected all the links you could want here: https://github.com/Zac-HD/stdlib-property-tests#further-reading

I've already written a tool to generate Python source code from 3.8's grammar, called hypothesmith - it's a standard strategy, which generates strings which can be passed to the compile builtin (as a convenient predicate for "is this actually valid source code").

hypothesmith is implemented from a reasonably good EBNF grammar, and "inverting" it using Hypothesis' built-in support for such cases (and the st.from_regex() strategy for terminals). The catch is that, for exactly the same reasons as PEP617 exists, I have a bunch of hacks where I filter substrings using the compile builtin - and so there's no way to generate anything not accepted by that function. Pragmatically, I think it's still worth using, since it's pretty quick to set up!

Extensions after vanilla `hypothesmith` is running against the parser

First, updating hypothesmith to treat the PEG grammar as authoritative is a no-brainer - I'm aiming to finish that in the next week or two. This would remove the "compile is trusted" limitation entirely. It also shouldn't be too hard - strategies are literally parser combinators, and to generate a string we just "run it backwards": start at the root, and choose random productions at each branch or generate random terminals, writing out the string that would drive such transitions.

Second, Hypothesis tests can be used as fuzz targets for coverage-guided fuzzers such as AFL or libfuzzer. Since we can also generate a wide variety of inputs from scratch, this can be a really powerful technique for ratcheting up our coverage of really weird edge cases. Or just hypothesis.target(...) things like the length of the generated code, the number of ast nodes, number of node types, nodes-per-length, etc.

Please be very precise about what you suggest

Install Hypothesis and Hypothesmith
Use @hypothesis.given(hypothesmith.from_grammar()) to generate inputs to a copy of your existing tests which run against stdlib / pypi / other python modules.
Fix any bugs you find, tell me what worked and/or what didn't.

Something I didn't think of.

You probably have more questions, and I hope I haven't been too presumptuous opening this issue. Please ask, and I'm happy to answer and help out (including reviewing and/or writing code) however I can, subject to other commitments and my Australian timezone!

Finally, thanks for all your work on this project - I'm really looking forward to what a precise and complete grammar will do for the core and wider Python community.

The text was updated successfully, but these errors were encountered:

pablogsal · 2020-04-18T16:15:37Z

I did some initial investigation and I may be missing something :( For what I can see, the generated examples for the code does not really generate many possibilities in the grammar space that are exercising the parser as it mostly generates empty module objects. For instance, I tried this very simple test that fails if the code generates an "ast.With" node:

import hypothesis
import hypothesmith
from hypothesis import settings, HealthCheck
import ast

settings.register_profile("slow", deadline=None, suppress_health_check=HealthCheck.all(),max_examples=100000)

@hypothesis.given(hypothesmith.from_grammar("file_input").filter(str.strip))
@hypothesis.settings.get_profile("slow")
def test_parser(code):
    x = ast.dump(ast.parse(code))
    assert "withitem" not in x

and after several minutes, the test did not find anything. Inspecting the generated ast for all the examples I can see that almost all examples generate an empty module:

Module(body=[], type_ignores=[])

Any idea of what is happening? Is this expected?

Zac-HD · 2020-04-18T22:53:47Z

Hmm. Turning up the verbosity setting will make it print each example it tries, which might make things clearer.

The first thing I'd try is just adding hypothesis.target(float(len(code)), label="characters in source code"), so Hypothesis tries longer examples. It's possible this will just go for e.g. really long string literals, in which case targeting number of ast nodes as well or instead might be better.

The second is that I might need to rework hypothesmith - it's really at proof-of-concept stage at the moment and deserves its 0.0.x version number! This could be as simple as sprinkling a bunch of filters through that reject empty productions (inelegant and slower, but works), or I might have to write that custom translation of the PEG grammar.

Zac-HD · 2020-04-19T02:13:45Z

@hypothesis.given(hypothesmith.from_grammar("file_input").filter(str.strip))
@hypothesis.settings.get_profile("slow")
def test_parser(code):
    x = ast.dump(ast.parse(code))

    # assume (in-test filter) away any functionally empty inputs
    assume(x != "Module(body=[])")
    n_instructions = float(len(list(dis.Bytecode(compile(code, "<string>", "exec")))))
    assume(n_instructions > 0)
    # target larger inputs - the Hypothesis engine will do a rough multi-objective 
    # hill-climbing search using these scores to generate 'better' examples.
    target(n_instructions, label="number of instructions in bytecode")
    target(float(len(x) - len("Module(body=[])")), label="length of dumped ast body")

    assert "withitem" not in x

Just adding some well-chosen target() calls improves things a lot, though it's still far from perfect - notably the no-with-nodes test still passes. Even so I think there's a decent chance at finding some offset issues, from non-ascii identifiers and really weird strings in comments.

You don't need to change the test though, as I've just released hypothesmith==0.0.7 with that targeting logic built-in. Effects typically clear by around ~100 examples and plateau around ~1000.

Longer term, we'll still need a new strategy based on a CST (e.g. the prototype at Zac-HD/hypothesmith#2) to get much further 😞

Zac-HD · 2020-04-24T08:54:29Z

@pablogsal - I've just released Hypothesmith 0.1, with a new CST-based from_node() strategy. This test now fails within a few seconds:

@hypothesis.given(hypothesmith.from_node())
@hypothesis.settings.get_profile("slow")
def test_parser(code):
    x = ast.dump(ast.parse(code))
    assert "withitem" not in x

Personally I'd use hypothesmith.from_grammar() | hypothesmith.from_node() to ensure that we catch bugs produced by either approach to code generation, but either way there have been substantial upgrades since the version you tested against above 😄

Zac-HD · 2020-05-04T07:37:45Z

@pablogsal - just saw this twitter thread and wanted to clarify (and don't have twitter) - the main reason hypothesmith.from_grammar() is underwhelming is precisely because the old grammar is imprecise.

Longer and more interesting strings are correspondingly more likely to include something which is permitted by the grammar but not actually valid syntax; and the engine therefore ends up avoiding those areas of the input space because there are few valid examples to try variations on. Unfortunately, this means it was spending most of the time exploring trivial examples like comments and whitespace.

The auto_target feature in 0.1.0 compensates for this somewhat, by focussing attention back on larger and more complex productions, but the underlying problem is still there. As far as I can tell LibCST doesn't have any such problems with imprecision, since it's much more like the new parser, and happily we can still boost its performance with auto_target.

pablogsal · 2020-05-04T10:45:36Z

Thanks for the clarification @Zac-HD! I am very excited with the latest improvements in hypothesmith. Sadly I only had a few hours to play with the new solution and I could not gather still enough experience to give actionable feedback, but as you say it seems that some underlying problems are still there even with the new improvements.

Longer and more interesting strings are correspondingly more likely to include something which is permitted by the grammar but not actually valid syntax; and the engine therefore ends up avoiding those areas of the input space because there are few valid examples to try variations on. Unfortunately, this means it was spending most of the time exploring trivial examples like comments and whitespace.

There is also the problem that although the density of valid strings is a set if measure zero is still an incredibly big and complex space to explore. For instance, some of the errors we found were in nested f-strings and when I played with the libCST based version it never generated one of them even after several hours.

Sadly the new PEG grammar won't help with this matter much because PEG parsers are not generative (the grammar is not written in a way that is easy to generate valid strings - as oppose to CFGs) but analytical (the grammar is written in a way that makes easy to check if a given string is part of the grammar). It may still be possible, but certainly not as easy.

Another thing to consider is that actually testing almost-correct strings is very important because we also want to make sure that our parser doesn't allow constructs we don't want. Now the only way to detect these cases is looking enough time at the grammar or trying to break it on purpose.

The auto_target feature in 0.1.0 compensates for this somewhat, by focussing attention back on larger and more complex productions, but the underlying problem is still there. As far as I can tell LibCST doesn't have any such problems with imprecision, since it's much more like the new parser, and happily we can still boost its performance with auto_target.

When I have more time I will give it another go to the new version, but I fear that it will not be as much help as hypothesis normally is given the aforementioned limitations. I think hypothesmith is very promising and I am very excited about it, but I think the tools still need to evolve more for it to actually found the same kind of bugs that hypothesis normally founds.

Zac-HD · 2020-05-04T12:11:56Z

Thanks for the feedback - it sounds like we're on the same page. Personally I'd still leave Hypothesmith tests running somewhere - while the expected and ideal case is that they don't find anything, CPU time is pretty cheap!

Some notes-to-self, comments from anyone welcome:

It would be great for Hypothesis to make a nice public API for 'swarm testing' (see Support automatic 'swarm testing' for example selection HypothesisWorks/hypothesis#1637), and then use it for productions in the grammar. I should get around to writing up that issue! We failed to generate nested f-strings precisely because "test features compete for space in each test, limiting the depth to which logic driven by features can be explored".
The PEG grammar couldn't be automatically converted anyway as Lark only supports an EBNF variant, but it should be fairly easy if tedious to translate productions into Hypothesis strategies by hand. Looking at the grammar, a rote translation into deferred and one_of strategies should get most of the way there.
Generating almost-correct strings typically requires starting with a valid input and corrupting it somehow. Mutating fuzzers like AFL give you this for free; with Hypothesis I'd just compose it with e.g. choosing a span to delete or duplicate or reverse or something. There's bound to be a literature on nice corrupting operations for source code...
Literally just pointing AFL or similar at this could also be a useful way to spend some CPU time.

gvanrossum · 2020-07-25T21:29:46Z

I don't believe there are concrete action items for the CPython parser here. I have added "play with Hypothesis" on my personal TO-DO list, but it's a looong list...

gvanrossum closed this as completed Jul 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Property-based testing for the parser #91

Property-based testing for the parser #91

Zac-HD commented Apr 18, 2020

pablogsal commented Apr 18, 2020

Zac-HD commented Apr 18, 2020

Zac-HD commented Apr 19, 2020 •

edited

Loading

Zac-HD commented Apr 24, 2020

Zac-HD commented May 4, 2020

pablogsal commented May 4, 2020 •

edited

Loading

Zac-HD commented May 4, 2020

gvanrossum commented Jul 25, 2020

Property-based testing for the parser #91

Property-based testing for the parser #91

Comments

Zac-HD commented Apr 18, 2020

We've tested against the stdlib and many popular packages - why might this find additional bugs?

Say we decide to use this. What's the catch?

Extensions after vanilla hypothesmith is running against the parser

Please be very precise about what you suggest

Something I didn't think of.

pablogsal commented Apr 18, 2020

Zac-HD commented Apr 18, 2020

Zac-HD commented Apr 19, 2020 • edited Loading

Zac-HD commented Apr 24, 2020

Zac-HD commented May 4, 2020

pablogsal commented May 4, 2020 • edited Loading

Zac-HD commented May 4, 2020

gvanrossum commented Jul 25, 2020

Extensions after vanilla `hypothesmith` is running against the parser

Zac-HD commented Apr 19, 2020 •

edited

Loading

pablogsal commented May 4, 2020 •

edited

Loading