Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AST-based program generation #2

Closed
Zac-HD opened this issue Aug 6, 2019 · 3 comments
Closed

AST-based program generation #2

Zac-HD opened this issue Aug 6, 2019 · 3 comments

Comments

@Zac-HD
Copy link
Owner

Zac-HD commented Aug 6, 2019

Grammar-based generation works, and gives us syntactically valid source code.

The next step is to get semantically valid source code! The clear best approach for this is to generate a syntax tree, and "unparse" it into source code. Based on experiments at the PyCon Australia sprints the best AST to use is probably from lib2to3 - and that will give us the unparsing for free via black.

After that, I'd like to go to a concrete syntax tree where we draw formatting information at the same time as the node. This would massively improve our usefulness for black, but it's a lot of extra work.

@Zac-HD
Copy link
Owner Author

Zac-HD commented Sep 11, 2019

I have a WIP based on libCST, which does indeed generate source code... though it's also very rough around the edges.

To take this further, I think a better approach would be to use typed_ast to generate a (probably) semantically-valid program, then convert this to a libCST version and unparse that. Done concurrently it should shrink well too!

Development strategy: instead of starting with everything (as above) and then fixing each node, I think a more productive approach would be to start with a minimal subset (e.g. integer arithmetic) and then gradually expand. Keeping everything well-behaved while adding a single node type should be tractable, and that means incremental development works again 😄

@Zac-HD
Copy link
Owner Author

Zac-HD commented Apr 24, 2020

The simplest possible way to generate CST nodes is to start with from_type(Module), and then register just enough patches that we can actually instantiate each node. This is exactly how Hypothesmith 0.0.8 actually works - do the easy thing first! - and it was good enough to find two way to generate invalid code with LibCST.

However, this approach uses default values for all node parameters which have defaults, and that makes us much less likely to find any issues related to e.g. whitespace and parentheses. Supporting those makes us responsible for balancing parentheses though, which is non-trivial.

I'd also like to support swarm testing over the generated productions. This isn't too hard conceptually, but hypothesis FeatureFlags is not (yet) public API, and an ideal implementation would have some way of avoiding 'dead ends' where we generate something that the feature config doesn't allow us to complete.

Finally, there are many nodes dependent on other nodes - which makes registering strategies for them a little tricky. We therefore make very liberal use of the deferred(lambda: from_type(...)) idiom. This could be reduced a great deal - though not eliminated! - by partitioning node types into topologically ordered groups and using deferred only where there really are cycles in the dependency graph. Maybe after the MVP is out the marginal gains to introspection and performace will seem worth it, but they're not at the moment.

@Zac-HD
Copy link
Owner Author

Zac-HD commented May 8, 2021

This is basically working now - registering more specific strategies for individual node types would be a substantial performance boost, but I'm happy with the current design for now.

@Zac-HD Zac-HD closed this as completed May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant