Skip to content

Conversation

Keno
Copy link
Member

@Keno Keno commented Jun 9, 2025

Background

I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax.

Current Design

One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range.

From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system.

The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y GreenNode tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure.

Proposed design

This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current GreenNode, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output).

This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of:

struct RawGreenNode
    head::SyntaxHead                  # Kind,flags
    byte_span::UInt32                 # Number of bytes covered by this range
    # If NON_TERMINAL_FLAG is set, this is the total number of child nodes
    # Otherwise this is a terminal node (i.e. a token) and this is orig_kind
    node_span_or_orig_kind::UInt32
end

The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current main, because we do not store the next_byte pointer (which would instead have to be recovered from the green tree using the usual O(log n) algorithm).

However, because we store node_span, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation.

I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the GreenNode design itself is not changed to use this faster alternative. We can consider doing so in a later PR.

Benchmark

The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for Expr construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).

Benchmark Results (not including https://github.com/JuliaLang/julia/pull/58674)
keno@demeter3:~/JuliaSyntax.jl$ git checkout main
Already on 'main'
Your branch is up to date with 'origin/main'.
keno@demeter3:~/JuliaSyntax.jl$ julia +1.11 --project=. test/benchmark.jl 
┌ Info: Benchmarks
│   ParseStream =
│    BenchmarkTools.Trial: 24 samples with 1 evaluation per sample.
│     Range (min … max):  198.340 ms … 280.989 ms  ┊ GC (min … max): 0.20% … 26.48%
│     Time  (median):     205.381 ms               ┊ GC (median):    0.55%
│     Time  (mean ± σ):   217.751 ms ±  24.016 ms  ┊ GC (mean ± σ):  7.44% ±  8.87%
│    
│      █                                                              
│      █▁▆▄▄▄▁▁▆▁▁▄▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▄▁▁▄▁▄▁▁▄▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
│      198 ms           Histogram: frequency by time          281 ms <
│    
│     Memory estimate: 76.93 MiB, allocs estimate: 719922.
│   GreenNode =
│    BenchmarkTools.Trial: 17 samples with 1 evaluation per sample.
│     Range (min … max):  251.179 ms … 391.227 ms  ┊ GC (min … max):  2.55% … 29.83%
│     Time  (median):     302.519 ms               ┊ GC (median):    18.71%
│     Time  (mean ± σ):   306.213 ms ±  43.110 ms  ┊ GC (mean ± σ):  18.46% ± 10.58%
│    
│      █           ▃                 ▃         ▃                      
│      █▁▇▁▁▁▁▁▁▁▁▇█▁▁▁▁▁▁▁▁▇▇▁▁▁▁▇▁▁█▁▁▁▁▁▁▁▁▁█▁▁▇▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▇ ▁
│      251 ms           Histogram: frequency by time          391 ms <
│    
│     Memory estimate: 153.94 MiB, allocs estimate: 2854628.
│   SyntaxNode =
│    BenchmarkTools.Trial: 8 samples with 1 evaluation per sample.
│     Range (min … max):  507.277 ms … 808.146 ms  ┊ GC (min … max): 14.49% … 43.18%
│     Time  (median):     565.327 ms               ┊ GC (median):    24.67%
│     Time  (mean ± σ):   631.830 ms ± 127.738 ms  ┊ GC (mean ± σ):  30.21% ± 12.49%
│    
│      ██       █ ██                                       █ █     █  
│      ██▁▁▁▁▁▁▁█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁█ ▁
│      507 ms           Histogram: frequency by time          808 ms <
│    
│     Memory estimate: 327.13 MiB, allocs estimate: 6227301.
│   Expr =
│    BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
│     Range (min … max):  454.268 ms … 571.073 ms  ┊ GC (min … max):  2.00% … 17.37%
│     Time  (median):     505.436 ms               ┊ GC (median):     8.89%
│     Time  (mean ± σ):   513.944 ms ±  41.950 ms  ┊ GC (mean ± σ):  12.32% ±  6.47%
│    
│      █         █   ██   █             █            █       ██    █  
│      █▁▁▁▁▁▁▁▁▁█▁▁▁██▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁██▁▁▁▁█ ▁
│      454 ms           Histogram: frequency by time          571 ms <
│    
│     Memory estimate: 191.81 MiB, allocs estimate: 3791475.
│   flisp =
│    BenchmarkTools.Trial: 2 samples with 1 evaluation per sample.
│     Range (min … max):  3.219 s …   3.333 s  ┊ GC (min … max): 0.00% … 2.48%
│     Time  (median):     3.276 s              ┊ GC (median):    1.26%
│     Time  (mean ± σ):   3.276 s ± 81.178 ms  ┊ GC (mean ± σ):  1.26% ± 1.75%
│    
│      █                                                       █  
│      █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
│      3.22 s         Histogram: frequency by time        3.33 s <
│    
└     Memory estimate: 36.78 MiB, allocs estimate: 1066930.

keno@demeter3:~/JuliaSyntax.jl$ git reset --hard origin/kf/refactorparsestream
HEAD is now at 0f09e28 Fix perf and add compat for :position access
keno@demeter3:~/JuliaSyntax.jl$ julia +1.11 --project=. test/benchmark.jl 
Precompiling JuliaSyntax...
  1 dependency successfully precompiled in 10 seconds
┌ Info: Benchmarks
│   ParseStream =
│    BenchmarkTools.Trial: 22 samples with 1 evaluation per sample.
│     Range (min … max):  207.116 ms … 296.863 ms  ┊ GC (min … max): 0.21% … 27.84%
│     Time  (median):     222.518 ms               ┊ GC (median):    1.35%
│     Time  (mean ± σ):   233.621 ms ±  25.820 ms  ┊ GC (mean ± σ):  7.10% ±  9.18%
│    
│      █ ▁  █ ██▁█  ▁▁▁   ▁         ▁ ▁    ▁    ▁                ▁ ▁  
│      █▁█▁▁█▁████▁▁███▁▁▁█▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█ ▁
│      207 ms           Histogram: frequency by time          297 ms <
│    
│     Memory estimate: 64.29 MiB, allocs estimate: 719899.
│   GreenNode =
│    BenchmarkTools.Trial: 15 samples with 1 evaluation per sample.
│     Range (min … max):  280.020 ms … 450.819 ms  ┊ GC (min … max):  3.68% … 34.52%
│     Time  (median):     336.011 ms               ┊ GC (median):    15.89%
│     Time  (mean ± σ):   351.585 ms ±  59.735 ms  ┊ GC (mean ± σ):  20.77% ± 12.63%
│    
│      ▁▁    █▁  ▁      ▁ ▁  ▁       ▁              █    ▁      ▁  ▁  
│      ██▁▁▁▁██▁▁█▁▁▁▁▁▁█▁█▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁█▁▁█ ▁
│      280 ms           Histogram: frequency by time          451 ms <
│    
│     Memory estimate: 163.32 MiB, allocs estimate: 2924442.
│   SyntaxNode =
│    BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
│     Range (min … max):  410.218 ms … 694.608 ms  ┊ GC (min … max): 11.38% … 45.49%
│     Time  (median):     521.888 ms               ┊ GC (median):    25.25%
│     Time  (mean ± σ):   525.763 ms ±  97.519 ms  ┊ GC (mean ± σ):  28.24% ± 11.57%
│    
│      █ █ █    █            █  █     █            ██              █  
│      █▁█▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
│      410 ms           Histogram: frequency by time          695 ms <
│    
│     Memory estimate: 238.26 MiB, allocs estimate: 4097225.
│   Expr =
│    BenchmarkTools.Trial: 11 samples with 1 evaluation per sample.
│     Range (min … max):  416.451 ms … 614.564 ms  ┊ GC (min … max):  7.63% … 27.62%
│     Time  (median):     470.478 ms               ┊ GC (median):    12.61%
│     Time  (mean ± σ):   486.044 ms ±  65.065 ms  ┊ GC (mean ± σ):  14.31% ±  9.88%
│    
│      ██ █  █ █       █        █    █         ██                  █  
│      ██▁█▁▁█▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
│      416 ms           Histogram: frequency by time          615 ms <
│    
│     Memory estimate: 175.39 MiB, allocs estimate: 3265119.
│   flisp =
│    BenchmarkTools.Trial: 2 samples with 1 evaluation per sample.
│     Range (min … max):  3.224 s …    3.377 s  ┊ GC (min … max): 0.00% … 2.45%
│     Time  (median):     3.301 s               ┊ GC (median):    1.25%
│     Time  (mean ± σ):   3.301 s ± 107.482 ms  ┊ GC (mean ± σ):  1.25% ± 1.73%
│    
│      █                                                        █  
│      █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
│      3.22 s         Histogram: frequency by time         3.38 s <
│    
└     Memory estimate: 36.78 MiB, allocs estimate: 1066930.

@Keno Keno requested review from KristofferC, c42f and mlechu June 9, 2025 00:22
@Keno Keno force-pushed the kf/refactorparsestream branch from 11cb8fd to 88cd5c1 Compare June 9, 2025 00:22
Copy link

codecov bot commented Jun 9, 2025

Codecov Report

Attention: Patch coverage is 90.31532% with 43 lines in your changes missing coverage. Please review.

Project coverage is 94.99%. Comparing base (58082d9) to head (dab8423).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/parse_stream.jl 89.40% 16 Missing ⚠️
src/tree_cursors.jl 75.00% 14 Missing ⚠️
src/green_node.jl 47.82% 12 Missing ⚠️
src/expr.jl 99.44% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #560      +/-   ##
==========================================
- Coverage   96.16%   94.99%   -1.17%     
==========================================
  Files          13       14       +1     
  Lines        4115     4236     +121     
==========================================
+ Hits         3957     4024      +67     
- Misses        158      212      +54     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Keno Keno requested a review from Copilot June 9, 2025 00:36
Copilot

This comment was marked as outdated.

## Background

I've written about 5 parsers that use the general red/tree green tree
pattern. Now that we're using JuliaSyntax in base, I'd like to replace
some of them by a version based on JuliaSyntax, so that I can avoid
having to multiple copies of similar infrastructure. As a result, I'm
taking a close look at some of the internals of JuliaSyntax.

## Current Design

One thing that I really like about JuliaSyntax is that the parser basically
produces a flat output buffer (well two in the current design, after
#19). In essence, the output
is a post-order depth-first traversal of the parse tree, each node annotated
with the range of covered by this range.

From there, it is possible to recover the parse tree without re-parsing
by partitioning the token list according to the ranges of the non-terminal
tokens. One particular application of this is to re-build a pointer-y green
tree structure that stores relative by ranges and serves the same incremental
parsing purpose as green tree representations in other system.

The single-output-buffer design is a great innovation over the pointer-y
system. It's much easier to handle and it also enforces important invariants
by construction (or at least makes them easy to check). However, I think
the whole post-parse tree construction logic is reducing the value of it
significantly. In particular, green trees are supposed to be able to serve
as compact, persistent representations of parse tree. However, here the
compact, persistent representation (the output memory buffer) is not usable
as a green tree. We do have the pointer-y `GreenNode` tree, but this has
all the same downsides that the single buffer system was supposed to avoid.
It uses explicit vectors in every node and even constructing it from the
parser output allocates a nontrivial amount of memory to recover the tree
structure.

## Proposed design

This PR proposed to change the parser output to be directly usable as a
green-tree in-situ by changing the post-order dfs traversal to instead
produce (byte, node) spans (note that this is the same data as in the
current `GreenNode`, except that the node span is implicit in the length
of the vector and that here the children are implicit by the position
in the output).

This does essentially mean semantically reverting #19,
but the representation proposed here is more compact than both main and
the pre-#19 representation. In particular, the output is now a sequence of:

```
struct RawGreenNode
    head::SyntaxHead                  # Kind,flags
    byte_span::UInt32                 # Number of bytes covered by this range
    # If NON_TERMINAL_FLAG is set, this is the total number of child nodes
    # Otherwise this is a terminal node (i.e. a token) and this is orig_kind
    node_span_or_orig_kind::UInt32
end
```

The structure is used for both terminals and non-terminals, with the iterpretation
differing between them for the last field. This is marginally more compact than
the current token list representation on current `main`, because we do not store
the `next_byte` pointer (which would instead have to be recovered from the green
tree using the usual `O(log n)` algorithm).

However, because we store `node_span`, this data structure provides linear time
traversal (in reverse order) over the children of the current ndoe. In particular,
this means that the tree structure is manifest and does not require the allocation
of temporary stacks to recover the tree structure. As a result, the output buffer
can now be used as an efficient, persistent, green tree representation.

I think the primary weird thing about this design is that the iteration over the
children must happen in reverse order. The current GreenNode design has constant
time access to all children. Of course, a lookup table for this can be computed
in linear time with smaller memory than GreenNode design, but it's
important to point out this limitation. That said, for transformation uses cases
(e.g. to Expr or Syntax node), constant time access to the children is not really
required (although the children are being produced backwards, which looks a little
funny). That said, to avoid any disruption to downstream users, the `GreenNode`
design itself is not changed to use this faster alternative. We can consider
doing so in a later PR.

## Benchmark

The motivation for this change is not performance, but rather representational cleanliness.
That said, it's of course imperative that this not degrade performance.
Fortunately, the benchmarks show that this is in fact marginally faster for `Expr`
construction, largely because we get to avoid the additional memory allocation traffic
from having the tree structure explicitly represented. Parse time itself is essentially
unchanged (which is unsurprising, since we're primarily changing what's being put into
the output - although the parser does a few lookback-style operations in a few places).
@Keno Keno force-pushed the kf/refactorparsestream branch from 88cd5c1 to dab8423 Compare June 9, 2025 02:13
@Keno
Copy link
Member Author

Keno commented Jun 9, 2025

Fortunately, the benchmarks show that this is in fact marginally faster for Expr construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented

Something appears to have regressed during final cleanup - probably a type instability somewhere. Will a look.

@Keno
Copy link
Member Author

Keno commented Jun 9, 2025

Julia has efficient prepend

Looks like I need to apologize to copilot here. While the O(n^2) assertion is false of course, pushfirst! does indeed cause more allocations than push! (seen as extra allocations in the GreenNode constructor above). Further, it appears that push!/reverse! is the fastest way to build these array. Although what's concerning to me is that it appears to benchmark even faster than push! without reverse!, so I don't really know what's going on.

@Keno
Copy link
Member Author

Keno commented Jun 9, 2025

I think that particular mystery can be solved another day. The best thing to do is just to avoid using GreenNode at all and iterating directly. This PR is done from my perspective.

@Keno
Copy link
Member Author

Keno commented Jun 9, 2025

@c42f not sure if you're around currently, but if you are, your input would be appreciated.

"(call-i (call-i a::Identifier *::Identifier b::Identifier) +::Identifier c::Identifier)"

@test sprint(highlight, t[1][3]) == "a*b + c\n# ╙"
@test sprint(highlight, t.source, t.raw, 1, 3) == "a*b + c\n# ╙"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this test removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reaches into the internals of SyntaxNode which changed. The line prior shows that the SyntaxNode itself can still be highlighted with the same result.

Keno added a commit that referenced this pull request Jun 9, 2025
This builds on top of #560 and replaces the use of `SyntaxNode`
in hooks.jl by the new lower-level cursor APIs. This avoid allocating
two completely separate representations of the syntax tree.

As a result, the end-to-end parse time for error-containing code is between 1.5x
(if the error is the first token) and 2x (if the error is the last
token) faster than current master. However, the main motivation here
is just to reduce coupling between the Expr-producing and SyntaxNode
producing parts of the code.
@Keno
Copy link
Member Author

Keno commented Jun 9, 2025

I'm pretty happy with this. I think my biggest concern is that the reverse-children API is a little awkward, though it's not terrible.

I think we could either

  1. the output to pre-order by reserving a node on mark and filling it in on emit
  2. reverse the output in place after parse

I think 1. would be best, but I don't know if there are ever any places where we re-use a mark, which wouldn't work in this scheme.

@Keno
Copy link
Member Author

Keno commented Jun 9, 2025

I think 1. would be best, but I don't know if there are ever any places where we re-use a mark, which wouldn't work in this scheme.

To answer my own question, collecting a few places:

  • # Parse invalid binary operators
    #
    # Having this is unnecessary, but it improves error messages and the
    # error-containing parse tree.
    #
    # a--b ==> (call-i a (error) b)
    function parse_invalid_ops(ps::ParseState)
    mark = position(ps)
    parse_expr(ps)
    while (t = peek_token(ps); kind(t) in KSet"ErrorInvalidOperator Error**")
    bump_trivia(ps)
    bump_dotsplit(ps)
    parse_expr(ps)
    emit(ps, mark, is_dotted(t) ? K"dotcall" : K"call", INFIX_FLAG)
    end
    end
  • JuliaSyntax.jl/src/parser.jl

    Lines 999 to 1029 in eceaa39

    # Parse left to right, combining any of `chain_ops` into one call
    #
    # flisp: parse-with-chains
    function parse_with_chains(ps::ParseState, down, is_op, chain_ops)
    mark = position(ps)
    down(ps)
    while (t = peek_token(ps); is_op(kind(t)))
    if ps.space_sensitive && preceding_whitespace(t) &&
    is_both_unary_and_binary(t) &&
    !preceding_whitespace(peek_token(ps, 2))
    # The following is two elements of a hcat
    # [x +y] ==> (hcat x (call-pre + y))
    # [x+y +z] ==> (hcat (call-i x + y) (call-pre + z))
    # Conversely the following are infix calls
    # [x +₁y] ==> (vect (call-i x +₁ y))
    # [x+y+z] ==> (vect (call-i x + y z))
    # [x+y + z] ==> (vect (call-i x + y z))
    break
    end
    bump_dotsplit(ps, remap_kind=K"Identifier")
    down(ps)
    if kind(t) in chain_ops && !is_decorated(t)
    # a + b + c ==> (call-i a + b c)
    # a + b .+ c ==> (dotcall-i (call-i a + b) + c)
    parse_chain(ps, down, kind(t))
    end
    # a +₁ b +₁ c ==> (call-i (call-i a +₁ b) +₁ c)
    # a .+ b .+ c ==> (dotcall-i (dotcall-i a + b) + c)
    emit(ps, mark, is_dotted(t) ? K"dotcall" : K"call", INFIX_FLAG)
    end
    end
  • JuliaSyntax.jl/src/parser.jl

    Lines 1091 to 1117 in eceaa39

    # flisp: parse-where-chain
    function parse_where_chain(ps0::ParseState, mark)
    ps = ParseState(ps0, where_enabled=false)
    while peek(ps) == K"where"
    bump(ps, TRIVIA_FLAG) # where
    bump_trivia(ps)
    k = peek(ps)
    if k == K"{"
    # x where \n {T} ==> (where x (braces T))
    # x where {T,S} ==> (where x (braces T S))
    # Also various nonsensical forms permitted
    # x where {T S} ==> (where x (bracescat (row T S)))
    # x where {y for y in ys} ==> (where x (braces (generator y (iteration (in y ys)))))
    m = position(ps)
    bump(ps, TRIVIA_FLAG)
    ckind, cflags = parse_cat(ps, K"}", ps.end_symbol)
    emit_braces(ps, m, ckind, cflags)
    emit(ps, mark, K"where")
    else
    # x where T ==> (where x T)
    # x where \n T ==> (where x T)
    # x where T<:S ==> (where x (<: T S))
    parse_comparison(ps)
    emit(ps, mark, K"where")
    end
    end
    end

Keno added a commit that referenced this pull request Jun 9, 2025
A I mentioned in #560, and as contemplated in #536, I'd like to try re-using
JuliaParser infrastructure to replace parsers I've written for some other languages.
This takes the first step to do so by moving various files into directories depending
on whether they are language-dependent or not. Right now there is still some coupling
and of course, there are no actual abstractions between these pieces. The idea would
be to intrduce those over time. For now, if we put in this refactoring, the way to
use this would be to copy the appropriate pieces (at least `core/`) into your
downstream parser and then rewrite it to those APIs. I'm planning to do that with
a parser or two to see if I hit any big API issues and see what it would take
to actually make the re-use happen.
Copy link
Member

@mlechu mlechu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new data structures!

reverse the output in place after parse

The current ordering (post-order DFS, children in source order) makes more sense in my head. Reversing the list (pre-order DFS, children reversed) wouldn't change the need to iterate through children in reverse, though we would be moving forward through the list.

end

# Get Julia value of leaf node as it would be represented in `Expr` form
function _expr_leaf_val(node::SyntaxNode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JuliaLowering assumes this is here (it's unused there too, so could probably be deleted.)

"""
GreenTreeCursor
Represents a cursors into a ParseStream output buffer that makes it easy to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Represents a cursors into a ParseStream output buffer that makes it easy to
Represents a cursor into a ParseStream output buffer that makes it easy to

Though this comment may not be necessary

An AST node with a similar layout to `Expr`. Typically constructed from source
text by calling one of the parser API functions such as [`parseall`](@ref)
"""
const SyntaxNode = TreeNode{SyntaxData}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field types in the docstring need updating (github doesn't let me comment on those lines though)

Comment on lines 59 to 60
raw::RawGreenNode
byte_end::UInt32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to replace this with a tree cursor? REPL, JuliaLowering, maybe others assume the raw children are accessible from node.raw, though I'm not certain that information wouldn't be redundant given that we will probably always have the SyntaxNode on hand in cases like these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends a bit on whether these are traversal trees or modification trees. We're a bit inconsistent with it at the moment. If they're modification trees, then we can't necessarily assume that we still have the flat buffer of raw nodes to index into. My recommendation would be to remove .raw accesses and define appropriate accessors on the syntax node for everything that is required. That way we can adjust the data structure to its actual needs later.

Copy link
Member

@mlechu mlechu Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends a bit on whether these are traversal trees or modification trees

If I understand correctly, you're saying that the SyntaxNode and raw green tree buffer would need to stay in sync, and there would be no good or sane way to modify them—isn't this currently the case, too?

define appropriate accessors on the syntax node for everything that is required

Looking through a SyntaxNode's children will only give us the RawGreenNodes that have associated SyntaxNodes, so we would lose some information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's two possible designs. One is that SyntaxNode is just a better traversal cursor that provides O(1) access to children and parents (like RedTreeCursor - but storing the indices - possibly I misnamed the red tree cursor, since it traditionally has those algorithmics in addition to computing absolute byte positions). The other is that it's a full concrete syntax tree. The difference is what happens when you want to modify the tree. If it's a traversal tree, you modify the underlying buffer and throw away all cursors because the got invalidated (and you need to update raw green nodes going up the stack). On the other hand, it it's a modification tree, then the way you update is that you change the val inside the tree and then you could print source by iterating over the leaves and printing their contents. Right now, SyntaxNode is a bit of a hybrid, because it's mutable and explicitly stores val, but then it also has position, which you'd generally only do for a traversal tree. My suggestion would be that we separate these explicitly. Anything that only needs traversal gets the immutable traversal data structure (i.e. RedTreeCursor + parents/child indices). And then we have a separate data structure for clients like formatters that actually want to change the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a reasonable design. I'll still argue that we get more (and more backwards-compatible) information for no additional bytes per node with a vector reference and index instead of the RawGreenNode, but I'm also biased by the things I've worked on so far (where everything wants SyntaxNodes and nothing wants to modify them). I think you would have a better idea of usage in the ecosystem.

JuliaFormatter is not a consideration until it stops being stuck on JuliaSyntax 0.4 :) (this is supposed to be on my to-do list for JETLS...)

but then it also has position, which you'd generally only do for a traversal tree

Is byte_end different?

Copy link
Member Author

@Keno Keno Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector reference and index instead of the RawGreenNode

Well, but you need to keep the vector, which you don't at the moment.

Is byte_end different?

No, that's basically just a rename (it changed from the start to the end location to be consistent with the underlying structucture, but otherwise semantically identical).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but you need to keep the vector, which you don't at the moment

You're right, I was assuming the parse stream was already being held onto somewhere, which isn't necessarily true.

@Keno
Copy link
Member Author

Keno commented Jun 10, 2025

Reversing the list (pre-order DFS, children reversed)

I didn't mean a literal reversal, but rather a tree-ordered reversal that changes the traversal order. It's a relatively easy thing to do, since it's possible to directly compute the position that a node would have in a pre-order traversal. That said, it is probably the case that most people will be creating appropriate traversal trees anyway, so it may not matter as much.

@Keno
Copy link
Member Author

Keno commented Jun 11, 2025

I think we can put .raw back to GreenNode for now. It's quite space wasteful, but I didn't realize quite how many downstream libraries were using it. Then we can keep iterating on a new traversal API afterwards.

@mlechu
Copy link
Member

mlechu commented Jun 11, 2025

OK with me. I think it should be doable to use the non-pointery green trees in .raw (are they isomorphic to GreenNode?) if you don't plan to remove the field, but it's up to you whether you want to deal with that change now.

More packages than I thought are relying on `raw` being a GreenNode,
in general, as discussed on the PR, we'd probably like to do more
work on traversal trees anyway, so this keep things consistent for
downstream while we do that and then when we have something better
downstream can switch.
@Keno
Copy link
Member Author

Keno commented Jun 11, 2025

Alright, restored raw to be GreenNode. @mlechu can you check that this works as intended for your JuliaLowering use case? If so, I'd like to get this merged, so I can do some of the extra follow on work.

Co-authored-by: Em Chu <61633163+mlechu@users.noreply.github.com>
@mlechu
Copy link
Member

mlechu commented Jun 11, 2025

I think I'll need to adapt to_expr to work again for JuliaLowering's Base.Expr(ex::SyntaxTree), but 699/709 tests pass, so I wouldn't consider it blocking. REPL completions also appear to work.

@mlechu
Copy link
Member

mlechu commented Jun 11, 2025

@c42f Feel free to chime in or put things back whenever you get back; I'm not certain where some of the more generic code should live

@Keno
Copy link
Member Author

Keno commented Jun 12, 2025

Alright, I'm gonna call that good enough. Juggling all my branches is starting to be annoying. @c42f Still very interested in your thoughts - so whenever you're back, please do chime in if you have thoughts.

@Keno Keno merged commit 36450b0 into main Jun 12, 2025
36 checks passed
@Keno Keno deleted the kf/refactorparsestream branch June 12, 2025 00:32
Keno added a commit that referenced this pull request Jun 12, 2025
This builds on top of #560 and replaces the use of `SyntaxNode`
in hooks.jl by the new lower-level cursor APIs. This avoid allocating
two completely separate representations of the syntax tree.

As a result, the end-to-end parse time for error-containing code is between 1.5x
(if the error is the first token) and 2x (if the error is the last
token) faster than current master. However, the main motivation here
is just to reduce coupling between the Expr-producing and SyntaxNode
producing parts of the code.
Keno added a commit that referenced this pull request Jun 12, 2025
This builds on top of #560 and replaces the use of `SyntaxNode`
in hooks.jl by the new lower-level cursor APIs. This avoid allocating
two completely separate representations of the syntax tree.

As a result, the end-to-end parse time for error-containing code is between 1.5x
(if the error is the first token) and 2x (if the error is the last
token) faster than current master. However, the main motivation here
is just to reduce coupling between the Expr-producing and SyntaxNode
producing parts of the code.
Keno added a commit that referenced this pull request Jun 12, 2025
This builds on top of #560 and replaces the use of `SyntaxNode`
in hooks.jl by the new lower-level cursor APIs. This avoid allocating
two completely separate representations of the syntax tree.

As a result, the end-to-end parse time for error-containing code is between 1.5x
(if the error is the first token) and 2x (if the error is the last
token) faster than current master. However, the main motivation here
is just to reduce coupling between the Expr-producing and SyntaxNode
producing parts of the code.
Keno added a commit that referenced this pull request Jun 12, 2025
A I mentioned in #560, and as contemplated in #536, I'd like to try re-using
JuliaParser infrastructure to replace parsers I've written for some other languages.
This takes the first step to do so by moving various files into directories depending
on whether they are language-dependent or not. Right now there is still some coupling
and of course, there are no actual abstractions between these pieces. The idea would
be to intrduce those over time. For now, if we put in this refactoring, the way to
use this would be to copy the appropriate pieces (at least `core/`) into your
downstream parser and then rewrite it to those APIs. I'm planning to do that with
a parser or two to see if I hit any big API issues and see what it would take
to actually make the re-use happen.

- core: Core functionality for parsing
- julia: Core functionality for parsing *julia*
- integration: Integration code to use as the parser for base
- porcelain: Other syntax tree types for external users of the package

The `integration` and `porcelain` components should not depend on each
other. Otherwise it's layered as expected. This is just the reorganization.
Additional work is required to actually spearate the abstractions.
Keno added a commit that referenced this pull request Jun 13, 2025
A I mentioned in #560, and as contemplated in #536, I'd like to try re-using
JuliaParser infrastructure to replace parsers I've written for some other languages.
This takes the first step to do so by moving various files into directories depending
on whether they are language-dependent or not. Right now there is still some coupling
and of course, there are no actual abstractions between these pieces. The idea would
be to intrduce those over time. For now, if we put in this refactoring, the way to
use this would be to copy the appropriate pieces (at least `core/`) into your
downstream parser and then rewrite it to those APIs. I'm planning to do that with
a parser or two to see if I hit any big API issues and see what it would take
to actually make the re-use happen.

- core: Core functionality for parsing
- julia: Core functionality for parsing *julia*
- integration: Integration code to use as the parser for base
- porcelain: Other syntax tree types for external users of the package

The `integration` and `porcelain` components should not depend on each
other. Otherwise it's layered as expected. This is just the reorganization.
Additional work is required to actually spearate the abstractions.
mlechu added a commit to mlechu/JuliaLowering.jl that referenced this pull request Jun 25, 2025
Changes are needed to work with the latest JuliaSyntax (particularly
     JuliaLang/JuliaSyntax.jl#560), but that isn't in
     Base yet, and more changes are likely to come, so I'm holding off on
     updating JuliaLowering at the moment.
@c42f
Copy link
Member

c42f commented Jul 21, 2025

Still very interested in your thoughts - so whenever you're back, please do chime in if you have thoughts.

Heya, I'm back.

It looks like this PR addresses two of the vague TODO's which were in my head for a while!

  1. An immutable tree data structure living in a single array.
  2. A cursor interface which might eventually replace some uses of the inefficient pointer-y trees we have right now

So that's great :)

It's a nice surprise that the tree structure can be more manifest inside ParseStream itself. The previous build_tree() code was quite a nightmare to get right and this clearly simplifies things a lot.

I think the benchmarks show this data layout is generally a bit slower (except for Expr construction)? Not a huge issue in practice I think.

My main reservation is the reverse-order iteration of children which feels very awkward even for a low-level API. I don't see how we can avoid this at the lowest level - ParseStream is inherently in reverse order without a post-processing step.

Ok, so we want a post-processing step? We do already have more than one post-processing step in the system; primarily that is build_tree() but we also have validate_tokens(). So I guess the right solution here is to have an efficient immutable flat buffer representation of a tree which is the output of build_tree(). Naturally that would be an eventual replacement for GreenNode, and the cursors would iterate children forward in such a representation. If possible we also want O(1) implementation of numchildren() I think, in that representation, as it's still common for the number of children to affect the semantics of the ith child, at least in the current Julia AST.

@Keno
Copy link
Member Author

Keno commented Jul 21, 2025

I think the benchmarks show this data layout is generally a bit slower (except for Expr construction)? Not a huge issue in practice I think.

The combination of this branch and JuliaLang/julia#58674 is faster than the version before my changes. I considered that good enough, although there were some perf opportunities left on the table. I don't think the representation is fundamentally slower for any reason.

If possible we also want O(1) implementation of numchildren() I think, in that representation, as it's still common for the number of children to affect the semantics of the ith child, at least in the current Julia AST.

My immediate plan was to add a tree cursor that caches the O(1) indices on construction and see if that's good enough. I think one of the primary requirements of the parsed representation is compactness and the O(1) information is redundant. I also think that if we push the O(1) behavior into the cursor API layer for now, that will let us change the immutable datastructure under the hood later based on actual measurement results for different clients (and perhaps different clients may actually want different tradeoffs).

@c42f
Copy link
Member

c42f commented Jul 21, 2025

Sure, having a cursor which caches whatever setup for forward traversal and an O(1) version of numchildren, etc, seems reasonable.

I don't think there's much functional difference between that and a hypothetical "new version of GreenNode".

There is a conceptual difference: calling these things cursors suggests to the user that they should be temporary. If compactness is key (unknown, but plausible), that's the right signal to send to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants