-
-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: codegen AST related codes #4134
Comments
@overlookmotel @rzvxa @Dunqing I think we can start prototyping once we decide on the generation method. I vote for all Rust, but I'm not sure about its implications or downsides. |
👍for all Rust, Now that we know our requirements and are done with prototyping we could leverage syn for our code generation. It is harder to iterate on but we won't need to parse and generate rust code ourselves, We can treat it like a global macro if you are willing to use nightly for codegen, We quite literally can use If we want to break our ast definition into separate files or don't use nightly at all; We can manually parse the string using syn and vice versa. Nice to have: we might want to generate the AST itself from our definition file, So it is easy to do AST-wide refactors in the future. |
Good catch, we should move all ast implementations to another file, so we can parse ast nodes easily. Or maybe we should scan the file using syn, and extract everything marked as On nightly: hard no, sorry. |
I'm more in favor of having the definition file(s) for ast containing only the struct/enum and its attribute/derives, With this we are not only simplifying the build process(since any item in the global scope of definition file is an AST type) but also provide a way to add to the codegen, For example, if you remember for |
I wanted to follow how React Compiler does this
but @overlookmotel prefers writing it in Rust because it feels more natural. I don't mind either. Also good catch on wanting to add common attributes to the nodes. |
I think in the end both the rust syntax and these JSON files are the same thing. Having it as rust syntax and parsing it to get our serialized data is more steps for sure but I have to agree with @overlookmotel, It is much easier to understand and reason about our schema this way, Unless we want to share these schemas with other projects I think having |
Rust or JS?I vote Rust too. I only wrote DesignI propose we do this in 2 parts:
The schema would be generated by:
Schema types could be something like https://github.com/overlookmotel/layout_inspect/blob/master/layout_inspect/src/defs.rs This schema can then be output as both:
Why?2 reasons: Firstly, AST transfer needs a schema just like this - so we can kill 2 birds with 1 stone. Secondly, the schema can also be used in proc macros. Fixing #4297 requires replacing the dodgy (see "Option 3" in #4297) We don't need to implement rkyv serialization initially. We can just use JSON to serialize the schema to start with. rkyv is an optimization to avoid having Generating AST typesPersonally I am anti generating the AST types themselves. I think we should write the AST types in Rust (as at present) and then generate everything around them. I laid out my reasoning for this in #4297 (at length, sorry probably way too much length). When we want to alter the AST types themselves, we can use proc macros. This does have a downside in terms of compile time, but the rkyv-serialized schema thing is designed to reduce that impact to a minimal level. These are just my thoughts... feel free to disagree! |
What if we have definitions (schema) written as rust struct/enums? Then if we even need to output a JSON schema we can create it from the syntax data instead of json information. Look at this and tell me it isn't possible to write all of this in rust https://github.com/facebook/react/blob/main/compiler/crates/react_estree_codegen/src/ecmascript.json. What are the differences between these 2 pieces of string other than deserialization method? "StringLiteral": {
"fields": {
"value": {
"type": "String",
"hermes_convert_with": "convert_string_value"
}
}
}, struct StringLiteral {
#[hermes_convert_with("convert_string_value")]
value: String,
} My point is that we can have our cake and eat it too, We can write definitions in the rust itself, yet still use this information and build upon it to generate the actual AST types. |
Example usecase: #[magic_attr]
struct StringLiteral<'a> {
#[hermes_convert_with("convert_string_value")]
pub value: Atom<'a>,
} ast/ast.rs #[hermes_stuff]
struct StringLiteral<'a> {
#[hermes_convert_with("convert_string_value")]
pub value: Atom<'a>,
magical_field: MagicType<'a>,
}
impl<'a, 'm> BlackMagic<'a, 'm>for StringLiteral<'a> {
fn cast(&'m self) -> MagicMisile<'a, 'm> {
self.magical_field.cast(MagicKind::Black).into()
}
} |
AST definition file should have its own prelude so we are confined to the use of known types in our ast definitions(happens in the proc macro/codegen). This way we can disallow the use statements or path qualifiers. With this imposed limitation on our ast definition we no longer have to resolve and link types, We just use a hashmap since they are all known beforehand. Ast uses other AST types, Box, Vec, Atom, str, and a handful of other types but there is no random type there(as it should be). |
Agreed. No difference. Except... then we have 2
One of our goals is for the codebase to be accessible to new contributors. The AST is a central pillar of Oxc, probably the first place you go when you start exploring the codebase. So I think it's particularly important that the AST "just makes sense". Entering a world of confusion on your very first step could be a massive turn-off. To summarize: as I understand it, the major pluses and minuses are:
So, my view is:
But... I've said my piece now. It's a preference rather than a strongly held view - I can see both sides of the argument. If you both feel strongly that you prefer to codegen AST, please feel free to overrule me. I won't be upset! Only thing I ask is please do try to go through my thinking in #4297, and try to follow my reasoning. If you get my arguments, but don't agree with them, no problem. But please do consider them.
🤣🤣🤣 Love it! |
It is ironic that it is my exact point for going in the opposite direction😆 I think since these are the actual things existing on the type, jumping to the generated definition is self-documenting because of no magic code, basically, what you see is what you get. so I can see
Then we manually export these generated codes in our |
I think for you to make the call on this @Boshen! |
Haha, can we start somewhere small? I propose: Gather the AST nodes somehow (we don't need the attributes yet), and layout the backbone of codegen for
once this is in place, we can experiment with different ast definition strategies and pick the one that fits my requirements. If parsing The pipeline is Definition -> Model in Rust -> an easy mechanism for writing out boilerplates -> code. The first milestone will just be the backbone of the [Model in Rust -> an easy mechanism for writing out boilerplates -> code] part. |
I'm going to go for a walk and mull this over. Give me an hour and I'll come back with some comments. |
To back my point, This is a common pattern in our current AST definition: inherit_variants! {
/// Expression
///
/// Inherits variants from [`MemberExpression`]. See [`ast` module docs] for explanation of inheritance.
///
/// [`ast` module docs]: `super`
#[visited_node]
#[repr(C, u8)]
#[derive(Debug, Hash)]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[cfg_attr(feature = "serialize", serde(untagged))]
pub enum Expression<'a> {
BooleanLiteral(Box<'a, BooleanLiteral>) = 0,
NullLiteral(Box<'a, NullLiteral>) = 1,
NumericLiteral(Box<'a, NumericLiteral<'a>>) = 2,
BigintLiteral(Box<'a, BigIntLiteral<'a>>) = 3,
RegExpLiteral(Box<'a, RegExpLiteral<'a>>) = 4,
StringLiteral(Box<'a, StringLiteral<'a>>) = 5,
TemplateLiteral(Box<'a, TemplateLiteral<'a>>) = 6,
Identifier(Box<'a, IdentifierReference<'a>>) = 7,
MetaProperty(Box<'a, MetaProperty<'a>>) = 8,
Super(Box<'a, Super>) = 9,
ArrayExpression(Box<'a, ArrayExpression<'a>>) = 10,
ArrowFunctionExpression(Box<'a, ArrowFunctionExpression<'a>>) = 11,
AssignmentExpression(Box<'a, AssignmentExpression<'a>>) = 12,
AwaitExpression(Box<'a, AwaitExpression<'a>>) = 13,
BinaryExpression(Box<'a, BinaryExpression<'a>>) = 14,
CallExpression(Box<'a, CallExpression<'a>>) = 15,
ChainExpression(Box<'a, ChainExpression<'a>>) = 16,
ClassExpression(Box<'a, Class<'a>>) = 17,
ConditionalExpression(Box<'a, ConditionalExpression<'a>>) = 18,
FunctionExpression(Box<'a, Function<'a>>) = 19,
ImportExpression(Box<'a, ImportExpression<'a>>) = 20,
LogicalExpression(Box<'a, LogicalExpression<'a>>) = 21,
NewExpression(Box<'a, NewExpression<'a>>) = 22,
ObjectExpression(Box<'a, ObjectExpression<'a>>) = 23,
ParenthesizedExpression(Box<'a, ParenthesizedExpression<'a>>) = 24,
SequenceExpression(Box<'a, SequenceExpression<'a>>) = 25,
TaggedTemplateExpression(Box<'a, TaggedTemplateExpression<'a>>) = 26,
ThisExpression(Box<'a, ThisExpression>) = 27,
UnaryExpression(Box<'a, UnaryExpression<'a>>) = 28,
UpdateExpression(Box<'a, UpdateExpression<'a>>) = 29,
YieldExpression(Box<'a, YieldExpression<'a>>) = 30,
PrivateInExpression(Box<'a, PrivateInExpression<'a>>) = 31,
JSXElement(Box<'a, JSXElement<'a>>) = 32,
JSXFragment(Box<'a, JSXFragment<'a>>) = 33,
TSAsExpression(Box<'a, TSAsExpression<'a>>) = 34,
TSSatisfiesExpression(Box<'a, TSSatisfiesExpression<'a>>) = 35,
TSTypeAssertion(Box<'a, TSTypeAssertion<'a>>) = 36,
TSNonNullExpression(Box<'a, TSNonNullExpression<'a>>) = 37,
TSInstantiationExpression(Box<'a, TSInstantiationExpression<'a>>) = 38,
// `MemberExpression` variants added here by `inherit_variants!` macro
@inherit MemberExpression
}
} It is only this way because it isn't feasible to copy all of these manually and keep track of them when one changes. So with code-gening the AST we still have this syntax for our definitions to keep it consistent, /// Expression
///
/// Inherits variants from [`MemberExpression`]. See [`ast` module docs] for explanation of inheritance.
///
/// [`ast` module docs]: `super`
#[visited_node]
#[repr(C, u8)]
#[derive(Debug, Hash)]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[cfg_attr(feature = "serialize", serde(untagged))]
pub enum Expression<'a> {
BooleanLiteral(Box<'a, BooleanLiteral>) = 0,
NullLiteral(Box<'a, NullLiteral>) = 1,
NumericLiteral(Box<'a, NumericLiteral<'a>>) = 2,
BigintLiteral(Box<'a, BigIntLiteral<'a>>) = 3,
RegExpLiteral(Box<'a, RegExpLiteral<'a>>) = 4,
StringLiteral(Box<'a, StringLiteral<'a>>) = 5,
TemplateLiteral(Box<'a, TemplateLiteral<'a>>) = 6,
Identifier(Box<'a, IdentifierReference<'a>>) = 7,
MetaProperty(Box<'a, MetaProperty<'a>>) = 8,
Super(Box<'a, Super>) = 9,
ArrayExpression(Box<'a, ArrayExpression<'a>>) = 10,
ArrowFunctionExpression(Box<'a, ArrowFunctionExpression<'a>>) = 11,
AssignmentExpression(Box<'a, AssignmentExpression<'a>>) = 12,
AwaitExpression(Box<'a, AwaitExpression<'a>>) = 13,
BinaryExpression(Box<'a, BinaryExpression<'a>>) = 14,
CallExpression(Box<'a, CallExpression<'a>>) = 15,
ChainExpression(Box<'a, ChainExpression<'a>>) = 16,
ClassExpression(Box<'a, Class<'a>>) = 17,
ConditionalExpression(Box<'a, ConditionalExpression<'a>>) = 18,
FunctionExpression(Box<'a, Function<'a>>) = 19,
ImportExpression(Box<'a, ImportExpression<'a>>) = 20,
LogicalExpression(Box<'a, LogicalExpression<'a>>) = 21,
NewExpression(Box<'a, NewExpression<'a>>) = 22,
ObjectExpression(Box<'a, ObjectExpression<'a>>) = 23,
ParenthesizedExpression(Box<'a, ParenthesizedExpression<'a>>) = 24,
SequenceExpression(Box<'a, SequenceExpression<'a>>) = 25,
TaggedTemplateExpression(Box<'a, TaggedTemplateExpression<'a>>) = 26,
ThisExpression(Box<'a, ThisExpression>) = 27,
UnaryExpression(Box<'a, UnaryExpression<'a>>) = 28,
UpdateExpression(Box<'a, UpdateExpression<'a>>) = 29,
YieldExpression(Box<'a, YieldExpression<'a>>) = 30,
PrivateInExpression(Box<'a, PrivateInExpression<'a>>) = 31,
JSXElement(Box<'a, JSXElement<'a>>) = 32,
JSXFragment(Box<'a, JSXFragment<'a>>) = 33,
TSAsExpression(Box<'a, TSAsExpression<'a>>) = 34,
TSSatisfiesExpression(Box<'a, TSSatisfiesExpression<'a>>) = 35,
TSTypeAssertion(Box<'a, TSTypeAssertion<'a>>) = 36,
TSNonNullExpression(Box<'a, TSNonNullExpression<'a>>) = 37,
TSInstantiationExpression(Box<'a, TSInstantiationExpression<'a>>) = 38,
@inherit MemberExpression
} But we give the end user this well-documented code to read: /// Expression
///
/// Inherits variants from [`MemberExpression`]. See [`ast` module docs] for explanation of inheritance.
///
/// [`ast` module docs]: `super`
#[visited_node]
#[repr(C, u8)]
#[derive(Debug, Hash)]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[cfg_attr(feature = "serialize", serde(untagged))]
pub enum Expression<'a> {
BooleanLiteral(Box<'a, BooleanLiteral>) = 0,
NullLiteral(Box<'a, NullLiteral>) = 1,
NumericLiteral(Box<'a, NumericLiteral<'a>>) = 2,
BigintLiteral(Box<'a, BigIntLiteral<'a>>) = 3,
RegExpLiteral(Box<'a, RegExpLiteral<'a>>) = 4,
StringLiteral(Box<'a, StringLiteral<'a>>) = 5,
TemplateLiteral(Box<'a, TemplateLiteral<'a>>) = 6,
Identifier(Box<'a, IdentifierReference<'a>>) = 7,
MetaProperty(Box<'a, MetaProperty<'a>>) = 8,
Super(Box<'a, Super>) = 9,
ArrayExpression(Box<'a, ArrayExpression<'a>>) = 10,
ArrowFunctionExpression(Box<'a, ArrowFunctionExpression<'a>>) = 11,
AssignmentExpression(Box<'a, AssignmentExpression<'a>>) = 12,
AwaitExpression(Box<'a, AwaitExpression<'a>>) = 13,
BinaryExpression(Box<'a, BinaryExpression<'a>>) = 14,
CallExpression(Box<'a, CallExpression<'a>>) = 15,
ChainExpression(Box<'a, ChainExpression<'a>>) = 16,
ClassExpression(Box<'a, Class<'a>>) = 17,
ConditionalExpression(Box<'a, ConditionalExpression<'a>>) = 18,
FunctionExpression(Box<'a, Function<'a>>) = 19,
ImportExpression(Box<'a, ImportExpression<'a>>) = 20,
LogicalExpression(Box<'a, LogicalExpression<'a>>) = 21,
NewExpression(Box<'a, NewExpression<'a>>) = 22,
ObjectExpression(Box<'a, ObjectExpression<'a>>) = 23,
ParenthesizedExpression(Box<'a, ParenthesizedExpression<'a>>) = 24,
SequenceExpression(Box<'a, SequenceExpression<'a>>) = 25,
TaggedTemplateExpression(Box<'a, TaggedTemplateExpression<'a>>) = 26,
ThisExpression(Box<'a, ThisExpression>) = 27,
UnaryExpression(Box<'a, UnaryExpression<'a>>) = 28,
UpdateExpression(Box<'a, UpdateExpression<'a>>) = 29,
YieldExpression(Box<'a, YieldExpression<'a>>) = 30,
PrivateInExpression(Box<'a, PrivateInExpression<'a>>) = 31,
JSXElement(Box<'a, JSXElement<'a>>) = 32,
JSXFragment(Box<'a, JSXFragment<'a>>) = 33,
TSAsExpression(Box<'a, TSAsExpression<'a>>) = 34,
TSSatisfiesExpression(Box<'a, TSSatisfiesExpression<'a>>) = 35,
TSTypeAssertion(Box<'a, TSTypeAssertion<'a>>) = 36,
TSNonNullExpression(Box<'a, TSNonNullExpression<'a>>) = 37,
TSInstantiationExpression(Box<'a, TSInstantiationExpression<'a>>) = 38,
// begin inherit from `MemberExpression` (if you know what inherit means in this context good for you, Otherwise you still get all the information without needing to jump 2 or more times first to the inherit_variants and then to the MemberExpression to understand it.)
/// `MemberExpression[?Yield, ?Await] [ Expression[+In, ?Yield, ?Await] ]`
ComputedMemberExpression(Box<'a, ComputedMemberExpression<'a>>) = 48,
/// `MemberExpression[?Yield, ?Await] . IdentifierName`
StaticMemberExpression(Box<'a, StaticMemberExpression<'a>>) = 49,
/// `MemberExpression[?Yield, ?Await] . PrivateIdentifier`
PrivateFieldExpression(Box<'a, PrivateFieldExpression<'a>>) = 50,
// end inherit from `MemberExpression`
} We would also generate the relevant match macros and those are also there to read(without nested declarative macro calls which can confuse newcomers) and maybe even omit the discriminates(in the definition file) since now we can calculate them at comptime (but having them explicitly defined is better in my opinion). I can't help but find a generated file more understandable than how We can call our definition file If we don't want to have magical #[inheritance]
enum E {
A(A),
#[inherit]
B(B),
} |
Not going to go into the detail of your comment, Rez, because this thread is getting too long. But I agree More generally, I've had a think, and have a few comments: To gen or not to gen?I have loosened my opposition to generating the AST types. I think what I was really opposed to is defining the schema as JSON (like Hermes/React Compiler does) which is verbose, ugly, and hard to read. Defining the schema as Rust types is much better. If we do the following, I think it would answer all my criticisms:
(not claiming all these ideas as my own, just summarizing in one place) How to beginAs Boshen says, we don't need to jump to doing this straight away. We can initially write our codegen for e.g. If we're agreed that in the end our "schema" will be written as Rust types, making the change later to "schema files" should not involve a huge change to the codegen. SchemaAST transfer needs a schema of all the AST types as JSON. So could we please build our codegen foundation along these lines:
If we do it like that, we can kill 2 birds with one stone, rather than having to build a 2nd parser/codegen for AST transfer later. (I think this is the plan anyway, but just repeating to make sure)
|
I am glad that you've backed off a little bit on your opposition, I believe all of your points are addressable(And I have a vague idea of how to achieve the rust-analyzer compatibility with preluding types and have a mock macro for inheritance to prevent syntax errors). Since you've written the Traverse codegen, Would you like to also give this one a go? I can start working on it next week so it is up to you to decide. Just a note: Even if we don't want to have rust schema we might want to consider starting with extracting type definitions from impls in our ast files, Having separate files like |
Gleam uses Cap'n proto for its codegen. |
Apparently we can't transfer an issue from a private repo to a public repo. I'll chat with @rzvxa on discord and set the milestones once he finishes all cfg related code. I'll create the issues in oxc when we have a consensus on what to build - the final version is in this issue's description. |
Question: what should be done with code surrounded by |
We could also codegen for all AST types:
NB: |
Oh dear! It would be ideal for all of the conversation in this issue to be made public, so people can see not only what decisions were reached, but also the reasons for reaching them. Personally, I think this is quite important. When I find code in Oxc which I don't understand the rationale for, I often look at git blame to find out in what PR that code was introduced, and then I read the surrounding issues. This helps me to understand why things have been designed the way they have. As a workaround, can we temporarily make backlog repo public, transfer the issue to oxc repo, then quickly make backlog repo private again?? |
The codegen will need to parse and understand the We can use that understanding to codegen all the impls which |
Sorry @rzvxa I didn't answer your question:
I'd be happy to, but I only have a couple of days before I go away, and am keen to finish off correcting spans and scopes in the transformer before I do. So if you're willing to work on this, by all means go ahead. |
I think this one should be done after Otherwise, it is the easiest to generate and has already been added to the ast_codegen. |
Oh ok, pick the easiest one to have it codegened so we can get it reviewed, merged, and replaced 😅 |
I've already did, Didn't mark it for review so you guys can give a green light to the underlying architecture first. Edit: We are basically done with both
|
These visit functions are outdated, do we have a better way to reimplement them here? It is now difficult to manually update oxc/crates/oxc_semantic/src/builder.rs Lines 454 to 1650 in 2fd55b0
|
Outdated as in field visitation order is wrong? |
I've moved this issue to main oxc repo as we're currently working on this. |
or missing visits to some fields |
Consensus was manually patch them, we eat our own 💩 |
It can be fine to have a slightly different visitation order for some implementations of our visits if it is necessary for them. The good thing about the generated code is that we have the correct walk order by default but it shouldn't prevent anyone from doing custom walks. It is fine as long as we acknowledge the difference and know it has a reason to be that way. |
I'm mostly scared of scope events since they can drift apart from the source of truth and we use the source of truth version everywhere but the semantics. Maybe we should generate the scope events as their own functions? something like |
or implement scope visit as a trait for our types so we can write |
We need a |
Could we pass the AST node ( |
I was thinking about this when I saw the @Dunqing's comment. But the type erasure will prevent us from doing anything useful to the If we don't want to pass in T we can pass |
Maybe that's better. I think just |
I used
This one I just forgot, You are absolutely right. |
pass |
It seems we all agree on this so I'll go ahead and implement it |
He done it! #4168 |
RFC is done, remains issues can be tracked in #3819 |
Requirements
#[visited_node]
Workflow
#[visited_node]
struct ASTModel { name: String, attributes: Vec<Attributes> }
Code to generate
ASTKind
(Milestone 1)GetSpan
(Milestone 2)Hash
,PartialEq
implementations (Milestone 3)ASTBuilder
(Milestone 4)Visit
(Milestone 5)VisitMut
(Milestone 5)Traverse
, remove current js code (Milestone 6)typescript definition for napi and wasmmoving the scope to AST transfer.Every milestone should be a stack of PRs, since they will involve changing other files as well.
Things we need to discuss:
traverse
currently uses JavaScript, I think it's fine, but does it scale to supporting these many code patterns?update: The consensus is stay with our current approach and parse
#[visited_node]
as the source of truth.Topic not part of this RFC:
The text was updated successfully, but these errors were encountered: