-
-
Notifications
You must be signed in to change notification settings - Fork 41
parser_stream: Produce green tree traversal rather than token ranges #560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
11cb8fd
to
88cd5c1
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #560 +/- ##
==========================================
- Coverage 96.16% 94.99% -1.17%
==========================================
Files 13 14 +1
Lines 4115 4236 +121
==========================================
+ Hits 3957 4024 +67
- Misses 158 212 +54 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).
88cd5c1
to
dab8423
Compare
Something appears to have regressed during final cleanup - probably a type instability somewhere. Will a look. |
Looks like I need to apologize to copilot here. While the O(n^2) assertion is false of course, |
I think that particular mystery can be solved another day. The best thing to do is just to avoid using GreenNode at all and iterating directly. This PR is done from my perspective. |
@c42f not sure if you're around currently, but if you are, your input would be appreciated. |
"(call-i (call-i a::Identifier *::Identifier b::Identifier) +::Identifier c::Identifier)" | ||
|
||
@test sprint(highlight, t[1][3]) == "a*b + c\n# ╙" | ||
@test sprint(highlight, t.source, t.raw, 1, 3) == "a*b + c\n# ╙" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this test removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reaches into the internals of SyntaxNode which changed. The line prior shows that the SyntaxNode itself can still be highlighted with the same result.
This builds on top of #560 and replaces the use of `SyntaxNode` in hooks.jl by the new lower-level cursor APIs. This avoid allocating two completely separate representations of the syntax tree. As a result, the end-to-end parse time for error-containing code is between 1.5x (if the error is the first token) and 2x (if the error is the last token) faster than current master. However, the main motivation here is just to reduce coupling between the Expr-producing and SyntaxNode producing parts of the code.
I'm pretty happy with this. I think my biggest concern is that the reverse-children API is a little awkward, though it's not terrible. I think we could either
I think 1. would be best, but I don't know if there are ever any places where we re-use a mark, which wouldn't work in this scheme. |
To answer my own question, collecting a few places:
|
A I mentioned in #560, and as contemplated in #536, I'd like to try re-using JuliaParser infrastructure to replace parsers I've written for some other languages. This takes the first step to do so by moving various files into directories depending on whether they are language-dependent or not. Right now there is still some coupling and of course, there are no actual abstractions between these pieces. The idea would be to intrduce those over time. For now, if we put in this refactoring, the way to use this would be to copy the appropriate pieces (at least `core/`) into your downstream parser and then rewrite it to those APIs. I'm planning to do that with a parser or two to see if I hit any big API issues and see what it would take to actually make the re-use happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new data structures!
reverse the output in place after parse
The current ordering (post-order DFS, children in source order) makes more sense in my head. Reversing the list (pre-order DFS, children reversed) wouldn't change the need to iterate through children in reverse, though we would be moving forward through the list.
end | ||
|
||
# Get Julia value of leaf node as it would be represented in `Expr` form | ||
function _expr_leaf_val(node::SyntaxNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JuliaLowering assumes this is here (it's unused there too, so could probably be deleted.)
""" | ||
GreenTreeCursor | ||
Represents a cursors into a ParseStream output buffer that makes it easy to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Represents a cursors into a ParseStream output buffer that makes it easy to | |
Represents a cursor into a ParseStream output buffer that makes it easy to |
Though this comment may not be necessary
An AST node with a similar layout to `Expr`. Typically constructed from source | ||
text by calling one of the parser API functions such as [`parseall`](@ref) | ||
""" | ||
const SyntaxNode = TreeNode{SyntaxData} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field types in the docstring need updating (github doesn't let me comment on those lines though)
src/syntax_tree.jl
Outdated
raw::RawGreenNode | ||
byte_end::UInt32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to replace this with a tree cursor? REPL, JuliaLowering, maybe others assume the raw children are accessible from node.raw, though I'm not certain that information wouldn't be redundant given that we will probably always have the SyntaxNode on hand in cases like these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends a bit on whether these are traversal trees or modification trees. We're a bit inconsistent with it at the moment. If they're modification trees, then we can't necessarily assume that we still have the flat buffer of raw nodes to index into. My recommendation would be to remove .raw accesses and define appropriate accessors on the syntax node for everything that is required. That way we can adjust the data structure to its actual needs later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends a bit on whether these are traversal trees or modification trees
If I understand correctly, you're saying that the SyntaxNode and raw green tree buffer would need to stay in sync, and there would be no good or sane way to modify them—isn't this currently the case, too?
define appropriate accessors on the syntax node for everything that is required
Looking through a SyntaxNode's children will only give us the RawGreenNodes that have associated SyntaxNodes, so we would lose some information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's two possible designs. One is that SyntaxNode is just a better traversal cursor that provides O(1) access to children and parents (like RedTreeCursor - but storing the indices - possibly I misnamed the red tree cursor, since it traditionally has those algorithmics in addition to computing absolute byte positions). The other is that it's a full concrete syntax tree. The difference is what happens when you want to modify the tree. If it's a traversal tree, you modify the underlying buffer and throw away all cursors because the got invalidated (and you need to update raw green nodes going up the stack). On the other hand, it it's a modification tree, then the way you update is that you change the val
inside the tree and then you could print source by iterating over the leaves and printing their contents. Right now, SyntaxNode
is a bit of a hybrid, because it's mutable and explicitly stores val
, but then it also has position
, which you'd generally only do for a traversal tree. My suggestion would be that we separate these explicitly. Anything that only needs traversal gets the immutable traversal data structure (i.e. RedTreeCursor + parents/child indices). And then we have a separate data structure for clients like formatters that actually want to change the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a reasonable design. I'll still argue that we get more (and more backwards-compatible) information for no additional bytes per node with a vector reference and index instead of the RawGreenNode, but I'm also biased by the things I've worked on so far (where everything wants SyntaxNodes and nothing wants to modify them). I think you would have a better idea of usage in the ecosystem.
JuliaFormatter is not a consideration until it stops being stuck on JuliaSyntax 0.4 :) (this is supposed to be on my to-do list for JETLS...)
but then it also has
position
, which you'd generally only do for a traversal tree
Is byte_end different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector reference and index instead of the RawGreenNode
Well, but you need to keep the vector, which you don't at the moment.
Is byte_end different?
No, that's basically just a rename (it changed from the start to the end location to be consistent with the underlying structucture, but otherwise semantically identical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, but you need to keep the vector, which you don't at the moment
You're right, I was assuming the parse stream was already being held onto somewhere, which isn't necessarily true.
I didn't mean a literal reversal, but rather a tree-ordered reversal that changes the traversal order. It's a relatively easy thing to do, since it's possible to directly compute the position that a node would have in a pre-order traversal. That said, it is probably the case that most people will be creating appropriate traversal trees anyway, so it may not matter as much. |
I think we can put |
OK with me. I think it should be doable to use the non-pointery green trees in |
More packages than I thought are relying on `raw` being a GreenNode, in general, as discussed on the PR, we'd probably like to do more work on traversal trees anyway, so this keep things consistent for downstream while we do that and then when we have something better downstream can switch.
Alright, restored raw to be |
Co-authored-by: Em Chu <61633163+mlechu@users.noreply.github.com>
I think I'll need to adapt |
@c42f Feel free to chime in or put things back whenever you get back; I'm not certain where some of the more generic code should live |
Alright, I'm gonna call that good enough. Juggling all my branches is starting to be annoying. @c42f Still very interested in your thoughts - so whenever you're back, please do chime in if you have thoughts. |
This builds on top of #560 and replaces the use of `SyntaxNode` in hooks.jl by the new lower-level cursor APIs. This avoid allocating two completely separate representations of the syntax tree. As a result, the end-to-end parse time for error-containing code is between 1.5x (if the error is the first token) and 2x (if the error is the last token) faster than current master. However, the main motivation here is just to reduce coupling between the Expr-producing and SyntaxNode producing parts of the code.
This builds on top of #560 and replaces the use of `SyntaxNode` in hooks.jl by the new lower-level cursor APIs. This avoid allocating two completely separate representations of the syntax tree. As a result, the end-to-end parse time for error-containing code is between 1.5x (if the error is the first token) and 2x (if the error is the last token) faster than current master. However, the main motivation here is just to reduce coupling between the Expr-producing and SyntaxNode producing parts of the code.
This builds on top of #560 and replaces the use of `SyntaxNode` in hooks.jl by the new lower-level cursor APIs. This avoid allocating two completely separate representations of the syntax tree. As a result, the end-to-end parse time for error-containing code is between 1.5x (if the error is the first token) and 2x (if the error is the last token) faster than current master. However, the main motivation here is just to reduce coupling between the Expr-producing and SyntaxNode producing parts of the code.
A I mentioned in #560, and as contemplated in #536, I'd like to try re-using JuliaParser infrastructure to replace parsers I've written for some other languages. This takes the first step to do so by moving various files into directories depending on whether they are language-dependent or not. Right now there is still some coupling and of course, there are no actual abstractions between these pieces. The idea would be to intrduce those over time. For now, if we put in this refactoring, the way to use this would be to copy the appropriate pieces (at least `core/`) into your downstream parser and then rewrite it to those APIs. I'm planning to do that with a parser or two to see if I hit any big API issues and see what it would take to actually make the re-use happen. - core: Core functionality for parsing - julia: Core functionality for parsing *julia* - integration: Integration code to use as the parser for base - porcelain: Other syntax tree types for external users of the package The `integration` and `porcelain` components should not depend on each other. Otherwise it's layered as expected. This is just the reorganization. Additional work is required to actually spearate the abstractions.
A I mentioned in #560, and as contemplated in #536, I'd like to try re-using JuliaParser infrastructure to replace parsers I've written for some other languages. This takes the first step to do so by moving various files into directories depending on whether they are language-dependent or not. Right now there is still some coupling and of course, there are no actual abstractions between these pieces. The idea would be to intrduce those over time. For now, if we put in this refactoring, the way to use this would be to copy the appropriate pieces (at least `core/`) into your downstream parser and then rewrite it to those APIs. I'm planning to do that with a parser or two to see if I hit any big API issues and see what it would take to actually make the re-use happen. - core: Core functionality for parsing - julia: Core functionality for parsing *julia* - integration: Integration code to use as the parser for base - porcelain: Other syntax tree types for external users of the package The `integration` and `porcelain` components should not depend on each other. Otherwise it's layered as expected. This is just the reorganization. Additional work is required to actually spearate the abstractions.
Changes are needed to work with the latest JuliaSyntax (particularly JuliaLang/JuliaSyntax.jl#560), but that isn't in Base yet, and more changes are likely to come, so I'm holding off on updating JuliaLowering at the moment.
Heya, I'm back. It looks like this PR addresses two of the vague TODO's which were in my head for a while!
So that's great :) It's a nice surprise that the tree structure can be more manifest inside I think the benchmarks show this data layout is generally a bit slower (except for My main reservation is the reverse-order iteration of children which feels very awkward even for a low-level API. I don't see how we can avoid this at the lowest level - Ok, so we want a post-processing step? We do already have more than one post-processing step in the system; primarily that is |
The combination of this branch and JuliaLang/julia#58674 is faster than the version before my changes. I considered that good enough, although there were some perf opportunities left on the table. I don't think the representation is fundamentally slower for any reason.
My immediate plan was to add a tree cursor that caches the O(1) indices on construction and see if that's good enough. I think one of the primary requirements of the parsed representation is compactness and the O(1) information is redundant. I also think that if we push the O(1) behavior into the cursor API layer for now, that will let us change the immutable datastructure under the hood later based on actual measurement results for different clients (and perhaps different clients may actually want different tradeoffs). |
Sure, having a cursor which caches whatever setup for forward traversal and an I don't think there's much functional difference between that and a hypothetical "new version of GreenNode". There is a conceptual difference: calling these things cursors suggests to the user that they should be temporary. If compactness is key (unknown, but plausible), that's the right signal to send to users. |
Background
I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax.
Current Design
One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range.
From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system.
The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y
GreenNode
tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure.Proposed design
This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current
GreenNode
, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output).This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of:
The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current
main
, because we do not store thenext_byte
pointer (which would instead have to be recovered from the green tree using the usualO(log n)
algorithm).However, because we store
node_span
, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation.I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the
GreenNode
design itself is not changed to use this faster alternative. We can consider doing so in a later PR.Benchmark
The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for
Expr
construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).Benchmark Results (not including https://github.com/JuliaLang/julia/pull/58674)