Explore alternatives to RustPython for Python AST parsing #286

charliermarsh · 2022-09-29T19:38:15Z

RustPython has been a good foundation, but Ruff is currently limited by what the RustPython parser does and does not support. See: #282, #54, #245.

Since we only need the parser (and RustPython is more ambitious -- they're trying to build an entire runtime / interpreter), there may be better options. RustPython is also going to be limited by their use of an LL(1) parser.

We could consider using the following:

charliermarsh · 2022-09-30T13:01:45Z

Alternatively, could we use the CPython AST parser directly? It's written in C, AFAIK it should be very fast (perhaps even faster than the RustPython parser).

charliermarsh · 2022-09-30T13:34:01Z

Alternatively, we could try to continue merging improvements back to RustPython.

charliermarsh · 2022-09-30T19:50:19Z

Alternatively, we could revisit the use of LibCST.

charliermarsh · 2022-09-30T22:18:19Z

LibCST will also likely be required for auto-formatting.

charliermarsh · 2022-10-01T21:02:29Z

tree-sitter is an interesting option (demo in #295). It produces a CST, so we could use it to power auto-formatting. It's slower than RustPython but still quite fast (~350ms to iterate over the CPython codebase vs. ~230ms with RustPython). Because it's based on S-expressions, it has a whole pattern-matching syntax built-in, which we could use to power plugins...

I think the ergonomics on our end won't be great, because it just exposes a single Node type with a kind: &str field, and so we'd have to do all pattern-matching on string identifiers rather than encoding the AST in a type system. It also may not map 1:1 with the abstract grammar, I'm not certain.

Going to explore a few other options, but I'm intrigued...

charliermarsh · 2022-10-03T21:26:06Z

There's also rust-python-parser, which is based on nom.

charliermarsh · 2022-10-03T22:27:19Z

There's also rust-sitter, which sits on top of tree-sitter and lets you define the grammar on the Rust side via macros. In return, you get semantically meaningful structs when you generate the parse tree. The downside is that we'd have to re-write the Python grammar ourselves.

charliermarsh · 2022-10-03T22:28:07Z

If we were to use tree-sitter, we may want to write code to transform the generated tree into AST types on the Rust side, similar to the RustPython AST. (That would have to be closer to a CST, though.)

charliermarsh · 2022-10-05T19:54:20Z

rust-python-parser takes ~480ms (vs. 280ms for RustPython's parser) and fails on these files:

resources/test/cpython/Lib/_compression.py
resources/test/cpython/Lib/ast.py
resources/test/cpython/Lib/contextlib.py
resources/test/cpython/Lib/distutils/tests/test_sysconfig.py
resources/test/cpython/Lib/idlelib/colorizer.py
resources/test/cpython/Lib/idlelib/idle_test/mock_tk.py
resources/test/cpython/Lib/idlelib/pyparse.py
resources/test/cpython/Lib/idlelib/replace.py
resources/test/cpython/Lib/idlelib/run.py
resources/test/cpython/Lib/importlib/_bootstrap.py
resources/test/cpython/Lib/lib2to3/tests/data/py3_test_grammar.py
resources/test/cpython/Lib/linecache.py
resources/test/cpython/Lib/logging/__init__.py
resources/test/cpython/Lib/test/coding20731.py
resources/test/cpython/Lib/test/test_asyncio/test_locks.py
resources/test/cpython/Lib/test/test_asyncio/test_pep492.py
resources/test/cpython/Lib/test/test_asyncio/test_server.py
resources/test/cpython/Lib/test/test_contextlib_async.py
resources/test/cpython/Lib/test/test_coroutines.py
resources/test/cpython/Lib/test/test_dis.py
resources/test/cpython/Lib/test/test_format.py
resources/test/cpython/Lib/test/test_pkgutil.py
resources/test/cpython/Lib/test/test_sys_settrace.py
resources/test/cpython/Lib/test/test_types.py
resources/test/cpython/Lib/test/test_typing.py
resources/test/cpython/Lib/unittest/test/testmock/testasync.py
resources/test/cpython/Lib/zoneinfo/_zoneinfo.py
resources/test/cpython/Tools/c-analyzer/c_parser/preprocessor/__init__.py
resources/test/cpython/Tools/scripts/stable_abi.py
resources/test/cpython/Tools/unicode/makeunicodedata.py

charliermarsh · 2022-11-22T01:37:44Z

Ok, time for me to revive this thread and start thinking on the right long-term parser strategy. We need to support the 3.10 constructs!

charliermarsh · 2022-11-22T02:32:30Z

I'm strongly considering writing a new parser atop rust-peg or nom that adheres to the RustPython AST interface.

charliermarsh · 2022-11-22T03:33:15Z

@Seamooo - Do you have any thoughts on or interest in working on this too?

Seamooo · 2022-11-22T04:29:48Z

Definitely have some interest in this area. A couple of comments as I've been going down various rabbit holes of python parser implementations:

The parser should be built from the CPython PEG, directly transpiled into either an equivalent grammar or used via proc_macro to generate the parser
- Run into a lot of grammar specifications with weird FIXMEs about multiple edge cases because their grammar attempted to be a superset of CPython's, that didn't quite reduce down to the same grammar.
proc_macro gives the ability to generate a perfect hash function for a set of keywords at compile time
rust-peg seems to be an excellent compromise between exploiting the lower-bound time complexity for Packrat parsers without explicitly requiring caching every match, allowing multiple matches rather than lookups for manual performance tuning
rust-peg also has the necessary extensions to the formalised PEG grammar such that implementing the python grammar is possible

Seamooo · 2022-11-22T04:32:27Z

Ideally when #742 becomes mergeable, implementing those traits for the IR output by the parser would make it immediately useable.

charliermarsh · 2022-11-22T04:52:28Z

Yeah rust-peg does look good, and I believe that's what LibCST uses. My only hesitation there is that the LibCST parser is really slow compared to the RustPython parser? But I don't know how much, if any, of that is due to rust-peg. This is just based on the fixtures in the LibCST codebase:

rust_python_parse/all   time:   [289.30 µs 289.96 µs 290.64 µs]
                        change: [-4.8537% -4.5331% -4.2286%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
  11 (11.00%) low mild
  2 (2.00%) high mild
  10 (10.00%) high severe

parse/all               time:   [1.7676 ms 1.7953 ms 1.8408 ms]
                        change: [-7.9448% -7.2284% -6.1601%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

parse_into_cst/all      time:   [2.2262 ms 2.2477 ms 2.2723 ms]
                        change: [-2.7483% -0.8482% +0.8620%] (p = 0.40 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

(rust_python_parse/all is the RustPython parser, parse/all is the LibCST parser without inflation, and parse_into_cst/all is the LibCST parser with inflation.)

charliermarsh · 2022-11-22T04:52:35Z

(And agreed on #742.)

charliermarsh · 2022-11-22T15:52:19Z

Perhaps we could extend pegen with a Rust target based on rust-peg.

isidentical · 2022-11-22T17:53:02Z

CC: @davidhalter (I think you were working on a Rust-based Fast Python Parser [?])

charliermarsh · 2022-11-22T19:15:41Z

I'm going to start playing around with a straight port of pegen to Rust. (Or, more specifically, modifying pegen to output a Rust target, which IIUC will also require rewriting the Python code from data/python.gram in Rust.)

davidhalter · 2022-11-22T23:26:07Z

Thanks for bringing this up @isidentical.

I have indeed written a Python parser in Rust. But I do not feel comfortable sharing it yet. While it's faster than any other Python parser I have seen, it's a bit rough around error recovery and few other things, but generally parses all valid 3.10 Python programs (AFAIK). So if I ever release it, I'm happy to let you guys know.

charliermarsh · 2022-11-22T23:43:33Z

Thanks @davidhalter! Super cool. Would love to see it if you ever release it.

If you don't mind sharing: did you write it from scratch? Or is it built atop a parser generator?

charliermarsh · 2022-11-23T04:53:59Z

(I started on this in earnest tonight, I don't know if it will prove to be the right approach but the lower bound is that I learn a lot.)

davidhalter · 2022-11-23T18:11:14Z

@charliermarsh Yes I wrote it from scratch, it is a kind of weird mixture of LL and PEG, it's also a parser generator. I essentially pass it a slightly modified version of the official EBNF grammar.

I personally love writing parser generators, so when the opportunity presented itself, I took it :)

charliermarsh · 2022-11-26T15:03:33Z

@Seamooo - Do you have any interest in collaborating w/ me on the Rust port of pegen? It's a private repo right now but would be happy to add you and talk about how I'm iterating on it.

Seamooo · 2022-11-27T03:00:17Z

Definitely would be interested

charliermarsh · 2022-11-27T03:34:34Z

@Seamooo - Added you, repo is rough but your help is very welcome!

ljodal · 2022-12-03T19:25:22Z

I'd love (read) access as well, if possible. I don't think I'm much help in contributing, but I've been looking at rust-peg to generate a Python-compatible ast. My rust knowledge isn't there yet though, so I'm finding it a bit hard to get going (I have a working POC with a few manually defined nodes, but I don't think that'll tell us anything useful about the performance hit).

charliermarsh · 2022-12-03T19:30:32Z

@ljodal - Added! Would love to have any help / feedback.

(Anyone is welcome to be added, I'm just avoiding making it totally public while it's still in such an incomplete state.)

DimitrisJim · 2022-12-12T12:22:30Z

Anyone is welcome to be added

I'd like to take you up on that invitation if the attempt is still ongoing :-). We've recently been thinking of doing the same with RustPython's parser, porting pegen would just reduce the differences between grammar files between it and CPython, making syncing with it much easier. Not sure how much time I have lying around but I'd help as I could.

charliermarsh · 2022-12-12T13:42:43Z

@DimitrisJim - Added! You'll be able to tell from the contribution dates, but I haven't been able to push on it as much as I'd like lately (just other Ruff work taking priority). Any / all help or feedback welcome.

While you're here: to be clear, my preference would be to continue using the RustPython parser :) The goal of this task is, implicitly, to remove the parser as the bottleneck for what Ruff can support, i.e., to get us to a point where the parser can support all current language features.

fanninpm · 2022-12-12T21:26:21Z

Rust port of pegen

I was thinking of adding a rust_generator.py file to CPython's copy of pegen, but I didn't quite know where to begin.

charliermarsh · 2022-12-12T21:28:04Z

@fanninpm - I added you to my rust-pegen project. Feel free to collaborate there or use it as knowledge / inspiration or whatever is most useful. It adds a rust_generator.py.

charliermarsh · 2022-12-12T21:29:27Z

There's also a PEG grammar (https://github.com/charliermarsh/rust-pegen/blob/main/pegen/data/rust.gram) (has to differ from the Python grammar because the PEG grammars include actual code). The generated code is here.

fanninpm · 2023-01-06T12:12:07Z

@qingshi163 what do you think?

charliermarsh · 2023-01-06T12:42:02Z

As an update: my current hope here is to continue to use RustPython, especially since there's been a lot of good progress in the parser recently, although structural pattern matching in some form is still the biggest hurdle.

LucianU · 2023-01-27T07:29:39Z

What would be the downside to using the CPython AST parser directly, since that is guaranteed to always be up-to-date?

flying-sheep · 2023-01-31T18:33:25Z

The big one I think is that CPython’s AST doesn’t seem to have all the formatting information (whitespace, comments) stored. That’s why CST parsers exist, like RustPython’s and LibCST (which is written in Rust, but as said above, seems to be slow). I’m pretty sure Ruff needs this information to operate on comments and to do autofixes without mangling comments and whitespace. (Actually, I think the old Parser created an intermediate CST, but the new one doesn’t anymore)

If a well maintained, fast CST library written in C existed, and we rephrased the question to “why not use that C library”, I think there would be two ways to do it, each with their downside:

When directly using its data structures, it would be difficult to safely manipulate on the CST (necessary for auto fixes).
When converting its data structures to Rust, there’s Rust code to maintain again (not a big problem, the data structures shouldn’t change too much), and there’s a performance hit (probably not too big either)

charliermarsh · 2023-01-31T19:06:01Z

Interestingly, RustPython's AST doesn't include any CST-like information -- it's very similar to the CPython AST. We get comments and other non-AST tokens from the token stream directly (comments are included in the token stream, but not in the AST).

charliermarsh · 2023-01-31T19:10:04Z

To me the downsides to using the CPython AST parser are as follows:

We need to figure out how to extract it from CPython as a useable standalone module (presumedly solvable but not something I've really investigated).
We need to figure out how to do the FFI between Rust and C (also presumedly solvable, but also not my strength).
It makes it much more difficult for us to ever use a specialized parser or AST representation for Ruff. RustPython doesn't necessarily solve this either, but I suspect that Ruff could benefit a lot from having its own parser that was written specifically Ruff (and similar use-cases) -- perhaps something that does produce a CST, or something that's optimized for the kinds of operations we perform.

LucianU · 2023-02-01T08:39:19Z

So, as far as I understand, there are several things here. Ruff needs a CST representation of Python.
CPython doesn't have one. RustPython doesn't have one either.

The existing solution is LibCST which is slow. Could the reason for its slowness be the fact that it does more work?
My understanding is that a CST contains more information than an AST.

So, I see a new question:
Which solution is more appropriate:

the current one that uses an AST parser and takes non-AST tokens from the stream OR
a proper CST parser?

charliermarsh · 2023-02-28T22:50:27Z

We've continued to improve the RustPython parser and it's working well for us! Some of these questions are coming up again with the autoformatter, where I have to map the AST to a CST by enriching it with information derived from the token stream.

In that context, I'm mostly focused right now on the CST representation itself and not the process by which it's parsed / extracted, which I'm punting until later.

charliermarsh · 2023-05-18T15:55:44Z

I'm going to close this for now, as we're continuing to use and contribute to the RustPython parser. Nothing actionable here.

charliermarsh added the enhancement label Sep 29, 2022

charliermarsh changed the title ~~Replace RustPython for Python AST parsing~~ Explore alternatives to RustPython for Python AST parsing Oct 18, 2022

andersk mentioned this issue Oct 26, 2022

Implement W605, invalid escape sequence #474

Closed

charliermarsh mentioned this issue Oct 30, 2022

Using ast-grep as a library ast-grep/ast-grep#80

Closed

g-as mentioned this issue Nov 21, 2022

Parenthesized context managers raise E999 #847

Closed

andersk mentioned this issue Dec 12, 2022

Parse Python 3.9+ parenthesized context managers RustPython/RustPython#4329

Merged

youknowone mentioned this issue Dec 12, 2022

Recent changes to CPython's grammar and PEG parser, PEP-617 RustPython/RustPython#2759

Open

charliermarsh added core Related to core functionality and removed enhancement labels Dec 31, 2022

DimitrisJim mentioned this issue Jan 5, 2023

[DRAFT] PEG Parser via rust-peg RustPython/RustPython#4423

Draft

charliermarsh closed this as completed May 18, 2023

Explore alternatives to RustPython for Python AST parsing #286

Explore alternatives to RustPython for Python AST parsing #286

Comments

charliermarsh commented Sep 29, 2022

charliermarsh commented Sep 30, 2022

charliermarsh commented Sep 30, 2022

charliermarsh commented Sep 30, 2022

charliermarsh commented Sep 30, 2022

charliermarsh commented Oct 1, 2022

charliermarsh commented Oct 3, 2022

charliermarsh commented Oct 3, 2022 • edited Loading

charliermarsh commented Oct 3, 2022

charliermarsh commented Oct 5, 2022

charliermarsh commented Nov 22, 2022

charliermarsh commented Nov 22, 2022

charliermarsh commented Nov 22, 2022

Seamooo commented Nov 22, 2022

Seamooo commented Nov 22, 2022

charliermarsh commented Nov 22, 2022

charliermarsh commented Nov 22, 2022

charliermarsh commented Nov 22, 2022

isidentical commented Nov 22, 2022

charliermarsh commented Nov 22, 2022

davidhalter commented Nov 22, 2022 • edited Loading

charliermarsh commented Nov 22, 2022

charliermarsh commented Nov 23, 2022

davidhalter commented Nov 23, 2022

charliermarsh commented Nov 26, 2022

Seamooo commented Nov 27, 2022

charliermarsh commented Nov 27, 2022

ljodal commented Dec 3, 2022

charliermarsh commented Dec 3, 2022

DimitrisJim commented Dec 12, 2022

charliermarsh commented Dec 12, 2022

fanninpm commented Dec 12, 2022

charliermarsh commented Dec 12, 2022 • edited Loading

charliermarsh commented Dec 12, 2022

fanninpm commented Jan 6, 2023

charliermarsh commented Jan 6, 2023

LucianU commented Jan 27, 2023

flying-sheep commented Jan 31, 2023 • edited Loading

charliermarsh commented Jan 31, 2023

charliermarsh commented Jan 31, 2023

LucianU commented Feb 1, 2023

charliermarsh commented Feb 28, 2023

charliermarsh commented May 18, 2023

charliermarsh commented Oct 3, 2022 •

edited

Loading

davidhalter commented Nov 22, 2022 •

edited

Loading

charliermarsh commented Dec 12, 2022 •

edited

Loading

flying-sheep commented Jan 31, 2023 •

edited

Loading