Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: libsyntax2.0 #2256

Closed
wants to merge 4 commits into from
Closed

RFC: libsyntax2.0 #2256

wants to merge 4 commits into from

Conversation

matklad
Copy link
Member

@matklad matklad commented Dec 23, 2017

Hi!

This RFC proposes to change AST and parser used by rustc, so as to create a solid base on top of which a great IDE support can be developed. The RFC is largely informed by my experience developing IntelliJ Rust and by my experiments with IDE-ready syntax tree in Rust in fall. I am not so knowledgeable about the internals of rustc and especially about macro expansion machinery, so input from the compiler team would be very valuable!

@rust-lang/compiler @rust-lang/dev-tools @nrc @jseyfried

Rendered

@mark-i-m
Copy link
Member

👍 x1000000

The fact that rust has neither a stable official correct parser nor an official correct grammar is very annoying. I think this is a major step towards making tooling easier to contribute to and more complete/stable.

@matklad
Copy link
Member Author

matklad commented Dec 23, 2017

Note that that the RFC explicitly does not propose to create an official grammar. There's https://github.com/nox/rust-rfcs/blob/master/text/1331-grammar-is-canonical.md for that.

However, I do hope to produce a comprehensive, progressive test-suite for rust parsers.

and other tools would require massive refactorings.

- Proposed syntax tree requires to keep the original source code
available, which might increase memory usage of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already do that, and Span refers to the source - we even load source of other crates on-demand (ignoring the contents if their hash has changed).
In general, I'd recommend discussing the existing implementation details of rustc with @rust-lang/compiler.

@eddyb
Copy link
Member

eddyb commented Dec 23, 2017

While this model doesn't use ADTs like rustc does, libsyntax's AST has more or less 3 main uses left in the compiler: macro expansion, name resolution (the two of which are intertwined) and lowering to HIR (which the rest of the compiler works with).

So as long as those 3 tasks can be performed without much difficulty, the actual representation of the AST doesn't matter much to the compiler itself.

My preference would be something auto-generated from an official grammar, such that everything matches the names of rules and whatnot from there. I am biased towards schemes which allow reuse of parse results, e.g. for when a macro or derive uses an input expression / type multiple times.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, firstly, I was under the impression that syn was the end goal of "stable libsyntax"

That said, syn doesn't handle comments. But still, having the library live in the crates ecosystem sounds like the best way to deal with this. For one, we don't have to tiptoe around having a typed API because it's fine to do major version updates.

For another, this way it doesn't impact the compiler; I'm concerned that this will lead to performance and readability issues since we're losing a lot of the typed AST if merged into the compiler.

As an "officially maintained crate" this makes sense, but less so as a replacement of libsyntax IMO.

}
}

pub const ERROR: NodeKind = NodeKind(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should do this with extensible enums instead. Those don't exist in rust, but can be added as an internal only feature.

Copy link
Member Author

@matklad matklad Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, one of the open implementation questions is how to store error payload. There are several workable approaches, but none of them screams being the correct one.

However, I do think that there should be only one error node kind in the ast, whose purpose is to contain tokens which are skipped by parse during error recovery.

@Manishearth
Copy link
Member

cc @mystor

@mystor
Copy link

mystor commented Dec 23, 2017

syn has quite different goals from a theoretical stable "libsyntax". syn aims to be correct enough for use by procedural macros, but is willing to sacrifice a lot of functionality to reduce build times.

For example, syn puts no effort into producing good error messages, because it is assumed that all input to it is well-formed. syn error messages usually read "unable to parse a Item" or something along the lines of that, because it does not track what went wrong.

syn is also being rewritten on top of the proc-macro API. This API is good for procedural macros, but isn't necessarily the sort of API which you'd want to build a general rust syntax parser on top of. For example, the span information exposed by proc-macro is intentionally limited to add flexibility in the implementation of rustc, but we would want to allow access to lower-level details in a theoretical libsyntax rewrite.

It's likely that at some point the ecosystem will either grow another rust parser, or syn will grow enough feature flags that it can start to pick up these sorts of use cases. I'm not sure which solution is preferable.

In general I think I support the idea of making a feature-complete alternative to libsyntax outside of the compiler, and I think it fills a different niche than syn. However, I doubt that the timeline for merging these efforts back into rustc will be short.

@mystor
Copy link

mystor commented Dec 23, 2017

Also, cc @dtolnay if he hasn't been pinged already, as he wrote most of syn.

@scottmcm scottmcm added the T-compiler Relevant to the compiler team, which will review and decide on the RFC. label Dec 24, 2017
@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

My preference would be something auto-generated from an official grammar,

I think we could and should generated typed AST layer, but the parser itself is better be hand-written.

Both IntelliJ and fall generate the parsing code as well, which is very useful because maintaining the parser is relatively low-effort. But the generated parsers have suboptimal error reporting, performance and correctness (because Rust already has some funny aspects of the grammar which are not natural to express in a declarative way).

While hand-writing a parser takes more effort, I think it's a good investment long-term. And given that we already have a hand-written parser, it may actually save the time :)

One problem with the generated AST and hand-written parser is that there's no guarantee that they match, but this is not a big problem in practice: such bugs are easy to notice, fix and test for.

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

By the way, I am still looking for a good term describing such non-Abstract Syntax Trees. So if anyone knows one, please let me know!

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

I am biased towards schemes which allow reuse of parse results, e.g. for when a macro or derive uses an input expression / type multiple times.

Reusing subtrees is also important for incremental reparsing, so I am very interested in allowing this as well. To do this, we should store lengths instead of offsets in syntax tree nodes, and store (lazily-calculated) offsets a the file level.

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

I'm concerned that this will lead to performance and readability issues since we're losing a lot of the typed AST if merged into the compiler.

@Manishearth please take a closer look at the Typed Tree section. I claim that it's possible to regain full type safety, by layering newtype wrappers on top of raw syntax tree.

It's easier to see this in action, so here's some example code from fall. This is the code for "add impl" action, which turns this:

struct Foo<X, Y: Clone> {}

into this

struct Foo<X, Y: Clone> {}

impl<X, Y: Clone> Foo<X, Y> {

}

Note how precisely the function in question characterizes the set of applicable syntactical constructs: T: NameOwner<'f> + TypeParametersOwner<'f>. These Owner traits are generated from the grammar and, for example, structs and enums implement them. You can even combine typed and untyped access in a single visitor :) In general, my experience from fall is that this two tiered implementation in Rust is a never ending stream of goodness.

It's true though that there are some performance considerations: there's no identifier interning built in, and accessing a child of particular type is a liner scan of all children. However, this tree should be converted to HIR early on, and it's always possible to create a side table which maps syntax tree nodes to arbitrary cached data, because the node is an integer index.

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

@Manishearth I'd like to reiterate that "stable libsyntax" is an explicit non-goal of this RFC. The intended clients of the library (if everything goes well, of which there's no certainty) are rustc, rls, rustfmt and clippy.

I also agree with @mystor that syn has slightly different constraints and that syn, and not libsyntax2.0, would be a better syn.

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

While hand-writing a parser takes more effort, I think it's a good investment long-term.

To expand on this a bit, I think the ideal would be to have two parsers in the compiler.

  1. An LR parser, generated directly from the grammar by something like LALRPOP, which does zero error reporting and error recovery, but is bloody correct and bloody fast.

  2. Something like proposed libysntax2.0, which does sophisticated error recovery and can parse anything that resembles rust into some sort of the tree.

The command line compiler would then use LR, and only LR parser for compilation. However, if LR parser fails to parse some file, compiler would reparse it with the libsyntax2.0 to give a nice error report. The IDE stuff would use libsyntax2.0.

@Centril
Copy link
Contributor

Centril commented Dec 24, 2017

@matklad That's sounds like a jolly good idea! An added benefit is that you then may define an Arbitrary impl for the AST / Token tree itself and use property based testing to check that forall ast. parser_1(show(ast)) == parser_2(show(ast)). You've then essentially tested that the sophisticated parser conforms to the spec. The downside can be a duplication of effort to some small extent but may be well worth it?

@Manishearth
Copy link
Member

@Manishearth please take a closer look at the Typed Tree section. I claim that it's possible to regain full type safety, by layering newtype wrappers on top of raw syntax tree.

Oh, I saw that, I'm saying we don't need that weird intermediate layer if we have extensible enums (or if we keep it out of tree so that it can be versioned independently)

I'd like to reiterate that "stable libsyntax" is an explicit non-goal of this RFC. The intended clients of the library (if everything goes well, of which there's no certainty) are rustc, rls, rustfmt and clippy.

Well, if we're not trying to stabilize it, then why the major changes? Your main point against current libsyntax is that it's not a pure function -- but this can fixed in a focused RFC!

I think this RFC as currently stated needs a lot more work clarifying what axes we're evaluating things against, and why the new proposal is better under those axes.

I also feel like the parser that IDEs and rustfmt need is different from the parser that the compiler needs (IDEs and rustfmt want parsers that are more lax with dealing with errors, and preserve comments); which only further makes me feel like having a separate officially-maintained library of which rustc is not a client would help. You yourself put forth two parsers -- only one of these is necessary in the compiler -- why not maintain this outside of tree?


TLDR, before moving forward here I'd like to see:

  • better motivation as to why current libsyntax needs change
  • better understanding of the axes along which we are evaluating libsyntax
  • better clarification as to why the proposal is better along those axes
  • more reasoning as to why this should be made part of rustc and not officially maintained out of tree (which makes it easier to version and also doesn't saddle rustc with additional complexity)

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

@Centril this is a little bit off topic, but the property based testing should proceed in slightly different manner :)

First, note that the proposed tree structure does not really have a pretty-print/show operation, because it simply does not exist without underlying source text. Second, if you generate AST you fail to check that both parsers fail to parse invalid code.

So the better property would be forall text. parser_1(text).is_err() and parser_2(text).is_err() or (parser_1(text).unwrap() == parser_2(text).unwrap()).

And to generate arbitrary text, you can take existing rust code for valid inputs, and for invalid inputs you can some valid code and cut&paste fragments of it over itself. This is how fall checks that incremental and full reparsers are equivalent (1, 2).

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

Yeah, totally agree that the motivation section needs more work!

My main mistake is that I overemphasized "pure function" part, while my main point is actually that it's difficult to build great tooling on top of current lossy libsyntax.

However, I don't want to dive too deep into discussion why libsyntax is worse for tooling than the proposed solution simply because libsyntax actually exists and libsyntax2.0 is a little more than a pipe dream at the moment. And essentially what this RFC proposes is "let's agree that building a prototype of libsytnax2.0 using the proposed tree structure is a good idea". No harm will be done to the compiler as a part of this RFC :) I am posting this as an RFC and not as a discussion on IRC or internals because it's a rather ambitions project anyway, and I would like to gather community consensus.

@Manishearth
Copy link
Member

And essentially what this RFC proposes is "let's agree that building a prototype of libsytnax2.0 using the proposed tree structure is a good idea"

Right, but it's not clear as to why the proposed structure is better :)

(it would be worth making it very explicit that this RFC does not try to replace libsyntax, just pave a path for something which may eventually replace it but through a different RFC, since otherwise there will be two separate proposals being discussed in tandem)

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

The one question I want to be answered before building a prototype is whether it is at all feasible to use proposed implementation in the compiler, or will macros through a wrench into the works? So I am eagerly waiting for @nrc or @jseyfried to clarify the following points

  1. Currently, parsing, name resolution and macro expansion are intertwined. Will it be possible to separate parsing into a separate step, so that resolve and expansion call into parser and use sytnax tree, but the parser itself knows nothing about macro expansion?

  2. Is it possible to parse a rust source file in isolation, without knowing it's location on the file system, the set of previously parsed files, etc. Looks like include_str! and include_bytes! and include! are not a problem?

  3. Is it possible to base macro expansion on top of the proposed tree structure? Currently, macro expander operates with token trees, which store hygiene information within themselvs (If I understand the source code correctly). Will it be possilbe to use the text instead of token trees, and store hygiene on the side?

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

So, to actual discussion of why the proposed structure is better, and what metrics are used to measure this "better" :)

The crucial point here is that, to make writing good tooling possible, the syntax tree must be lossless, it must explicitly account for comment nodes and whitespace. It also must be capable of representing partially-padsed code. For example, for

fn main() {
    foo(xs, 1
}

the parser must produce a function call node for foo(xs, 1 and parameter nodes for xs and 1.

However, I don't think I can back this claim up with something else then "it's obvious that it is supposed to work this way": all IDE stuff I've written was using lossless trees. Maybe I am wrong and it is actually quite possible to create a perfect IDE support on top of conventional AST.

But, if we assume "losslessness" as an axiom, then representing syntax as a generic tree which points to ranges in the original string seems to be the minimal and natural approach?

The trick with typed wrappers seems more weird of course, but, in my personal experience, its a great way to represent ASTs.

@Manishearth, I think I don't fully understand your point about extensible enums, I though it was about making error specifically an enum like enum SyntaxError { MissingSemicolon, ... #[doc(ignore)] __not_exhaustive}? Or is it about some way of representing single inheritance OOP? Could you give an example?

The problem with AST is that AST nodes don't naturally form a hierarchy, and you need all of structs, enums and traits to work with them in a type-safe manner. For example, you need a concrete struct for nodes like "if expression", "structure definition", you'll need an enum for "any expression" and a trait for "stuff with generic parameters, which also may have a where clause". The proposed implementation allows to generate an arbitrary collection of types for representing AST nodes, while using a plain Copy index as the internal representation, which seems quite nifty to me!

So here are the axis which differentiate between libsyntax and proposed libsyntax2

  1. losslessness of the tree
  2. Isolation and reusability of API (even if we move libsyntax to crates.io, we'll probably need to move stirng interner with it as well, because you want interned identifiers in real AST)
  3. Conceptual elegance (hugely subjective, of course!)
  4. Similar representation is successfully used in compiler+IDE use case.

@matklad
Copy link
Member Author

matklad commented Dec 24, 2017

more reasoning as to why this should be made part of rustc and not officially maintained out of tree

I am not entirely sure if IDE and compiler should use separate parsers or a single one (some discussion on internals 1). Originally I was of the opinion the their requirements differ to much, and macros make this whole idea impossible, but now I think that it is in fact feasible to use the single parser.

The obvious benefit is of course code reuse.

The less obvious benefit is that it's not entirely clear where compiler ends and IDE starts, and the syntax tree seems to be one natural boundary. It's useful to share the actual tree data structure between the compiler and the IDE, because the IDE can reparse file incrementally, and compiler has to do a full reparse (or it has to learn about text editing). It's also interesting to think about how compiler/IDE reports errors. For sure the error must be detected by the compiler. But IDE must be able to suggest a quick fix for it, and quick fixes are all about text editing, file formatting and interacting with the user. So it would be nice if the compiler could collect some context during checks, and then pass this context information to some layer which either prints it to stdout as an error message, or suggest a quick fix in IDE. It would be nice if this context information could use a language of syntax tree!

Macing macro expansion to work with IDE tree also allows some nifty features like live debugging macro expansions, for example :)

@Manishearth
Copy link
Member

However, I don't think I can back this claim up with something else then "it's obvious that it is supposed to work this way": all IDE stuff I've written was using lossless trees.

Nah, this is still useful data 😄

And I see why lossless trees work well here!

My preferred way of representing this would be to use a complete AST as much as possible, and have nodes like ItemKind::PartialItem, PartialExpr which are more like lex trees with some parsed info when things are partial. But I can see why that's problematic.

Still, worth putting this other proposal up there, I don't know if it's actually better -- you're far more experienced here than I 😄

I think I don't fully understand your point about extensible enums, I though it was about making error specifically an enum like enum SyntaxError { MissingSemicolon, ... #[doc(ignore)] __not_exhaustive}? Or is it about some way of representing single inheritance OOP? Could you give an example?

It's not really relevant anymore, but I was under the impression you were doing the untyped tree so that you wouldn't have a stability problem -- and this is better tackled by having a typed tree where you are never allowed to have an exhaustive match on an enum, you must have a wildcard branch.

Stability was not the actual motivation here, as you clarified, so this is a moot point 😄

The less obvious benefit is that it's not entirely clear where compiler ends and IDE starts, and the syntax tree seems to be one natural boundary.

I disagree, but this is pretty subjective anyway 😄 . But yeah, this isn't something we need to discuss in this RFC.


Reading through the RFC again with the motivations in mind I think it's much clearer why we should do this. I like the design! Still feel like the "full AST where possible" design might be better but I'm not sure.

Might leave more specific comments later.

whitespaces and desugar some syntactic constructs in terms of the
simpler ones.

In contrast, for IDEs it is crucial to have a lossless view of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to include explicit examples, "lossless view" wasn't clear to me till you explained it in the comments.

So something like:

IDEs need to be able to handle partial code like and still help you with autocompletion while continuing to work with the surrounding, non-partial, code. For this to work we need to be able to represent this losslessly, ....



```
FILE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering: How do we figure out the best way to interpret a partial tree? Should we rely on indentation as a hint? Are there well-known solutions here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the array of tricks I am aware of:

  1. Code is typically typed left-to-right, so it's possible to have "commit points" in grammar/parser. Here's an example from fall:
pub rule fn_def {
  'const'? 'unsafe'? linkage?
  'fn' <commit> ident
  type_parameters?
  value_parameters
  ret_type?
  where_clause?
  {block_expr | ';'}
}

This <commit> means that as soon parser sees fn keyword, it produces a function node, even if there are parsing errors after this point. Commit effectively makes the trailing part optional. In other words, one way to deal with partial parses is to treat certain prefixes of parses as parses. These commit points nicely mesh up with ll-ness of the parser: you commit just after the necessary lookahead.

  1. Code usually has block structure, so it makes sense, when parsing a block expression, first parse it approximately as a token tree, and then try to parse the internals of the block. That way, parse errors stay isolated to blocks. Of course, you can do this with all kinds of lists and what not, as long as you can invent a robust covering grammar. This trick also makes incremental parsers more efficient (as changes are isolated to one block unless it's a change in the block structure itself) and allows one to lazily parse stuff. There are certain variations of this trick: for example, you can lex string literals as just any stuff between ", and then lex the contents of the literal with a second lexer, which properly deals with escape sequences.

  2. When parsing repeated constructs, like item*, a useful strategy of error recovery is, after each (partially) parsed item skip tokens until a token from FIRST(item) appears. (example from fall)

field.


## Typed Tree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is kinda reminiscent of Go's AST, however Go's AST is designed that way more because Go doesn't have ADTs. (I've always found it annoying to work with because of that, but this isn't an inherent problem in this approach)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in this representation, you could use enums and pattern match, for example, against all kinds of expressions, but you won't be able to destruct structs.

@scottmcm
Copy link
Member

Might be interesting to look at how Roslyn solved a bunch of these same challenges for C# parsing and ASTs. There are probably better resources, but here's one I could find quickly: https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/

@matklad
Copy link
Member Author

matklad commented Dec 25, 2017

Might be interesting to look at how Roslyn solved a bunch of these same challenges for C# parsing and ASTs.

A most reasonable question! I've investigated a little how trees are represented in Dart SDK and Roslyn.

So for both Dart and C#, an object oriented approach is used, where there's a hierarchy of classes corresponding to syntactical constructs, and each class stores its children as typed fields. However, it is possible to view AST as untyped because all nodes share a supertype, are linked via parent links, and provide a getChildren method, which reconstructs a list of children from fields.

But they use different strategies to represent whitespace and comments.

In Roslyin, each token class has two fields for leading and trailing trivia.

The Dart looks weird (or maybe I am reading the wrong thing?) Seems like they build an explicit linked-list of non-whitespace non-comment tokens, and additionally attach comments (but not whitespace?) to the following token: https://github.com/dart-lang/sdk/blob/0c48a1163577a1157ea16c00b2fe914c1759357b/pkg/front_end/lib/src/scanner/token.dart#L477

This all makes me less sure that the representation I propose is good, though I still think it's better then alternatives. One of its worse parts is that accessing a node is linear in the number of children, and not constant. However, this can be fixed in a couple of ways: first, you can store a tree in such a way that all children of a node are stored in a continuous slice, which makes linear time pretty fast. Second, because a node is essentially a file-local index, you can store side tables like Vec<StructDefData>, Vec<TraitDefData> which make AST lookup constant. And what's interesting about the last trick is that you are very flexible in what data you store in such sidetables. Joke that it looks like SoA, ECS, and that we should build compiler like a game engine so that we can rewrite a browser like a game engine.

@arielb1
Copy link
Contributor

arielb1 commented Dec 25, 2017

Joke that it looks like SoA, ECS, and that we should build compiler like a game engine so that we can rewrite a browser like a game engine.

The compiler itself is already an ECS, this is just pushing stuff to the parser too.

* clarify goals and motivation: this is about IDE stuff, and not about
  stable access to AST

* Elaborate specifics of IDEs

* Retroactively justify the proposed syntax tree structure by listing
  design constraints which it satisfies
@nikomatsakis
Copy link
Contributor

@matklad ok so I read this last night. I'm definitely 👍 on the general idea of having a shared parsing library. I think the details here matter a lot -- for example, what representation to use, whether and what to auto-generate, etc -- but I don't necessarily think that an RFC is the right place to hash them out. I'm personally optimistic that we can craft a single library that is usable for IDEs, proc macros, and the compiler, but it'll definitely take some iteration and tinkering to get the balance right. (I don't consider these use cases as particularly divergent, though proc macros add the fun of wanting to be more extensible.)

I was thinking that it might be profitable to discuss these matters "live", at the upcoming Rust All Hands gathering, presuming that many of the stakeholders will be there?

One thing I would like to note:

I've been wanting for some time to add a mode to LALRPOP where it generates values of some pre-defined type, much like the trees you define here. The idea would be that you just write a grammar with no actions and we'll build up a tree; we could then layer tree transformers on top of that (a bit like what ANTLR does, iirc). I'd love for this "tree representation" we are discussing here to be an independent standard that we could use for that -- this would in turn allow us to have both hand-written and LALRPOP-generated parsers that are compatible (I'm not sure how hand writing buys us when compared against LALRPOP's existing error recovery mechanisms, but it's really hard to tell).

I think -- strategically -- it's a mistake to start out with too big of a goal. We should probably start by iterating on actual code and putting it to use. But another way, I feel like replacing libsyntax would be the "final step", not the first one. But it's good to have that goal in mind as we plan our steps, to make sure we're not accidentally building up things with critical flaws.

@mark-i-m
Copy link
Member

I'm not sure how hand writing buys us when compared against LALRPOP's existing error recovery mechanisms, but it's really hard to tell

My understanding from the discussion above was that we gained a lot in terms of custom error messages. For example, would it be possible to do something like rust-lang/rust#48858 with an auto-generated parser? Also, is Rust's grammar even known to be LALR?

[drawbacks]: #drawbacks

- No harm will be done as long as the new libsyntax exists as an
experiemt on crates.io. However, actually using it in the compiler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/experiemt/experiment/


* It is minimal: it stores small amount of data and has no
dependencies. For instance, it does not need compiler's string
interner or literal data representation.
Copy link
Contributor

@pickfire pickfire Apr 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean internal?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, a "string interner" is a data structure that combines all copies of a string into one so they can be shared, for lower memory usage and fast comparison.

new tree.

* A prototype implementation of the macro expansion on top of the new
sytnax tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/sytnax/syntax/

@mark-i-m
Copy link
Member

Out of curiosity, what were the results of the All Hands meeting on this?

@matklad
Copy link
Member Author

matklad commented Apr 27, 2018

@mark-i-m argh, I've should have written it here ages ago, thanks for the ping!

The conclusion from all-hands discussion was that a lot here depends on the actual implementation details, and that it makes sense to experiment with parse tree approach.

For experimenting, we've decided that it would be more interesting to add a parse-tree mode to LALRPOP (which differs from "let's write another rust parser by hand" approach I've proposed in RFC).

The current work on LALRPOP is tracked in this issue: lalrpop/lalrpop#354

@matklad
Copy link
Member Author

matklad commented Jul 28, 2018

I've just studied the implementation of swift's libsyntax, and it has some nice ideas. It is also surprisingly easy to understand.

They use a three-layered representation:

  • First layer is RawSyntax: immutable persistent (i.e, it stores lengths, not ranges) homogeneous tree (green node).
  • Second layer is SyntaxData: immutable lazy materializable homogeneous tree with ranges (red node).
  • Third layer is Syntax and it's generated subclasses. This is a typed layer in the style of this RFC.

Now, what makes their representation really interesting is that they store child nodes in an array, and not in a linked list (as proposed in the RFC). This allows for O(1) child access operation:

class ClassDeclSyntax final : public DeclSyntax {
public:
  enum Cursor : uint32_t {
    Attributes,
    Modifiers,
    ClassKeyword,
    Identifier,
    GenericParameterClause,
    InheritanceClause,
    GenericWhereClause,
    Members,
  };
};


llvm::Optional<GenericParameterClauseSyntax> ClassDeclSyntax::getGenericParameterClause() {
  auto ChildData = Data->getChild(Cursor::GenericParameterClause);
  if (!ChildData)
    return llvm::None;
  return GenericParameterClauseSyntax {Root, ChildData.get()};
}

What I find sub optimal about swift representation is that whitespace is attached directly to tokens, which make it harder to implement "declaration owns all preceding comments" logic. This can be fixed using the following representation:

struct Trivia {
    kind: SyntaxKind,
    text: InternedString,
}

type Trivias = SmallVec<Arc<Trivia>>;

struct GreenNode {
    kind: SyntaxKind,
    len: TextUnit,
    leadingTrivia: Trivias,
    children: [(Arc<GreenNode>, Trivias)], //DST
}

Another bit is that Arc-based GreenNodes representation is optimized for a rare use-case of modification. At the same time, I think a majority of syntax nodes are immutable once created. A nice thing about layered representation though is that we can have different implementations of layers. In particular, RedNodes might be backed either by GreenNodes, or by some more compact representation which stores all nodes of a single file in a single array.

EDIT: a short video overview of libsyntax: https://www.youtube.com/watch?v=5ivuYGxW_3M

@matklad
Copy link
Member Author

matklad commented Jul 30, 2018

An interesting observation about swift's tree:

Each node holds an Arc to the root of the tree and a raw pointer to a particular tree node. This is convenient, because you get owned syntax tree nodes with value semantics and parent pointers. However that means that read-only traversal of the tree generates atomic refcount increments/decrements, which I expect might generate contention if a single file processed concurrently. That is, Root in return GenericParameterClauseSyntax {Root, ChildData.get()}; is actually a non-free atomic operation (I might be reading C++ wrong here).

However in Rust, we can make the strong pointer to root generic, and use either Arc<Root> or &'a Root for it. Moreover, it is possible to implement

fn as_borrowed<'a>(owned: &'a SyntaxNode<Arc<Root>>) -> SyntaxNode<&'a Root>

That is, it is possible at runtime to switch from an owned Arced version to a Copy (!) borrowed version for local processing and I think (with a bit of unsafe magic) back as well.

@lnicola
Copy link
Member

lnicola commented Jul 31, 2018

@matklad Are you aware of the Roslyn team's work on this? It might be worth looking into. There's a very high level description here and maybe here.

It does look similar to what Swift is doing.

@matklad
Copy link
Member Author

matklad commented Jul 31, 2018

@lnicola yes, I am aware of Rolsyn's approach. The Swift libsyntax is indeed a realization of this red/green. Their implementation is much easier to read, and also does not rely on GC.

@maxbrunsfeld
Copy link

I've just been doing some work on tree-sitter-rust, the incremental rust parser that Atom will soon ship with (and hopefully Xray will ship with at some point).

I had one interesting realization related to macros: for syntax highlighting purposes, and probably other purposes as well, it's desirable to, if possible, parse the contents of token trees as expressions and items, as opposed to leaving them in the unstructured form that rustc -Z ast-json-noexpand represents them in.

For example, before I added this feature, code like this would not syntax highlight very nicely, because we wouldn't know that c.d is a field access and that T and U are types.

assert_eq!(a::b::<T, U>(), c.d);

Of course, not all token trees have a regular structure like this, so we need to 'fall back' to parsing them in an unstructured way. With tree-sitter, I handle this via GLR's ambiguity resolution mechanism. With a hand-written parser, you could probably use a multi-pass approach.

I also might be overthinking this; I'm curious if you have thoughts on this issue.

@matklad
Copy link
Member Author

matklad commented Aug 5, 2018

I had one interesting realization related to macros: for syntax highlighting purposes, and probably other purposes as well, it's desirable to, if possible, parse the contents of token trees as expressions and items, as opposed to leaving them in the unstructured form that rustc -Z ast-json-noexpand represents them in.

That is true. Extend selection is next to useless if what looks to a human like an expression is represented as a token tree in the syntax tree.

I also agree that non-deterministic parsing of macro invocation body as an expression or an item is a good approximation, especially with GLR approaches, where you can cheaply try all variants of the parsing.

However the proper solution here is indeed a two phase parsing. You need to know the difference between macro_rules! foo { ($arg:expr) => { ... }} and macro_rules! foo { ($arg:item) => { ... }} to parse macro invocation correctly. So, you first need parse the file in isolation, treating each macro invocation as a token stream. Then you need to resolve macro calls, interpret macro definitions and then inject true syntax trees into macro calls. And don't forget that two-phase parsing is also useful for things like embedding CSS/JS into HTML!

So, for practical purposes, I would probably have used the following progression:

  1. treat macro invocations as token trees
  2. hard-code println family of macros ([e]print[ln], write[ln], format, panic, unreachable, logging macros)
  3. Add support for two-phase parsing
  4. Using two-phase and GLR, implement heuristic approach to parsing (I'd do it on top of two-phase and not directly in the grammar just to exercise two-phase approach a bit)
  5. Extend heuristics by, for example, trying to guess the appropriate macro definition based on the index of macros in the project (surprisingly, such "resolve" wouldn't be too far away from truth for 2015-style macros).

@maxbrunsfeld
Copy link

So, you first need parse the file in isolation, treating each macro invocation as a token stream. Then you need to resolve macro calls, interpret macro definitions and then inject true syntax trees into macro calls. And don't forget that two-phase parsing is also useful for things like embedding CSS/JS into HTML!

Yeah, multi-phase parsing is definitely useful in general; we currently use it for things like JS in HTML, templating languages like EJS, etc. The hard parts of what you propose seems to be resolving macro calls correctly and interpreting macro definitions.

You need to know the difference between macro_rules! foo { ($arg:expr) => { ... }} and macro_rules! foo { ($arg:item) => { ... }} to parse macro invocation correctly.

But I don't think you could do this based on some finite heuristic, because macro patterns can have arbitrary nesting and complexity. For example, in this macro from serde, the outer token tree isn't an expression, but there are many expressions nested within it, before and after the => token.

declare_tests! {
    // ...

    test_result {
        Ok::<i32, i32>(0) => &[
            Token::Enum { name: "Result" },
            Token::Str("Ok"),
            Token::I32(0),
        ],
   }

   // ...
}

With my current approximate approach based on GLR, I'm able to determine on-the-fly that these inner token trees are expressions, but the outer one is not. How would a heuristic-based system deal with macros like this?

@matklad
Copy link
Member Author

matklad commented Aug 7, 2018

How would a heuristic-based system deal with macros like this?

I mean something like "heuristically resolve macro call, and then interpret the macro definition". But this is probably a lot of effort for little gain in comparison to what GLR already gives you.

@matklad
Copy link
Member Author

matklad commented Aug 7, 2018

Status update:

I've implemented (approximately) Swift style syntax tree in libsyntax2, in both owned and borrowed variants: https://github.com/matklad/libsyntax2/blob/2fb854ccdae6f1f12b60441e5c3b283bdc81fb0a/src/yellow/syntax.rs

I've also hacked quite a bit on the parser itself, so that it now parses a majority of non-weird Rust constructs. Here's, for example, libsyntax2-based extend selection: https://www.youtube.com/watch?v=21NbnLhj-S4

@matklad
Copy link
Member Author

matklad commented Jan 7, 2019

Status update:

@bstrie
Copy link
Contributor

bstrie commented Jan 7, 2020

@matklad I'm happy to see that rust-analyzer is still progressing and I'm eager for updates, though I'm also wondering if the RFC itself here is still relevant?

@Centril
Copy link
Contributor

Centril commented Jan 7, 2020

I've substantially refactored libsyntax itself since then; it is now split into libsyntax (soon librustc_ast), librustc_parse, and librustc_expand. Moreover, if I might say so, I think the code quality has substantially improved and the organization has improved. That said, the fundamental architecture has not changed.

@matklad
Copy link
Member Author

matklad commented Jan 8, 2020

Yeah, I guess it's time for the yearly update.

The main thing is that we (primarily @edwin0cheng :) ) explored the macro expansion. There's no true name resolution hygiene yet, but there's support for goto definition from/into/through macros which needs roughly the same infrastructure.

Given that the RFC is phrased in terms of "experiment, then do a separate rfc-ish process for upstreaming", I think this actually means that the RFC is fully implemented :D Should we just merge it retroactively?

I do plan to start discussion on upstreaming soon ("next", to be precise). The plan I had in mind is to focus on sharing the parser, without sharing the syntax tree itself yet. Possible steps are:

  • refactor rustc parser further to be more like a pure function (#64197 is a representative issue here)
  • refactor the input of the parser to be more like proc-macro token streams (this is the most fundamentally annoying parts, due to the fact that compiler's internal token streams embedd ast fragments, which is observable from the outside)
  • remove the coupling between the parser and ast, to make it possible for a parser to produce different types of AST. This is probably the most annoying change technically. An interesting prior art here is that Swfit is not doing this (report)
  • add rust-analyzer style error resilience to rustc parser
  • share the parser between rustc and rust-analyzer

After that we might want to converge on the single data structure for a syntax tree, but that probably should wait for the next yearly update :-)

The immediate blocker is that I need to document the current state of parsing/syntax trees in rust-analyzer, so that t-compiler can get a solid understanding of what I am actually proposing, technically :-)

@Mark-Simulacrum
Copy link
Member

We discussed this in a T-compiler meeting yesterday and felt that probably this doesn't need to be an RFC in today's world, and that most of the pieces in play here are already seeing action through both rust-analyzer/librarification and the newly formed parser library group (Zulip stream).

I'm going to go ahead and close this RFC as such -- we expect that we'll probably be seeing major change proposals from the parser library group, and once we're in a better position to look at next steps we can re-examine an RFC like this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-syntax Syntax related proposals & ideas T-compiler Relevant to the compiler team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.