Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ast_node_id field to all ast nodes #2818

Closed
rzvxa opened this issue Mar 26, 2024 · 18 comments
Closed

Add ast_node_id field to all ast nodes #2818

rzvxa opened this issue Mar 26, 2024 · 18 comments

Comments

@rzvxa
Copy link
Contributor

rzvxa commented Mar 26, 2024

We should have a field similar to IdentifierReference's reference_id that would keep the AstNodeId of each AST node.

Can be written manually, using a declarative ast_node macro for definition or a procedural derive macro.
I guess we can also remove AstNode as it can be changed into SOA in the AstNodes structure.
If we remove the AstNode struct we can add a trait with the same name which would implement a function called ast_node_id that would give us the information for getting things that AstNode would've provided before. With this trait, we can use any ast node with the help of generics since with the ast_node_id alone we can infer the node's kind, parents, scope, control flow, and anything that gets added down the road.

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 26, 2024

cc @Boshen

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 26, 2024

Let's say we are in the process of visiting an expression, Is there any way that can get that expression's AstNodeId? If there is not we should certainly add one.

For doing such a thing 2 things come to my mind.

  1. Adding a reverse hash between the inner node(the AST item itself, expression in the example above) and the AstNodeId to retrieve it after semantic analysis.
  2. Go through with adding the ast_node_id field to all those AST structs to allow access to the AstNodeId without
    any extra lookup table.

Without any one of these, we have to basically intercept the visit at a much higher level than necessary to gain knowledge of the surrounding nodes.

I believe that having ast_node_id on individual structs can reduce the complexity when working with ASTs since it would eliminate the need for an additional wrapper structure(AstNode) around the node itself. Instead of being limited to AstNode fields we can fetch all sorts of data using a node's ID from various sources.
It would also work very well with SOA designs. Although it is a bit imperative.

@Boshen
Copy link
Member

Boshen commented Mar 26, 2024

It's a 👍 for adding a node id because rustc has them as well https://doc.rust-lang.org/beta/nightly-rustc/rustc_ast/node_id/struct.NodeId.html, so the arguments are not theoretical but practical.

Since this requires a lot of effort, I think we need a few real world scenarios that can motivate this change, to decide whether we want to do this right now, or later when it's absolutely necessary.

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 26, 2024

Well, let me try my best to explain a real-world situation in which this change can help make things easier.

Let's say we want to transform a given node based on its parent, Right now this is achieved by monitoring all nodes and keeping a stack of them. I'm going to take the ArrowFunctions transformer as an example but after studying the babel transformers I find it a common thing.

impl<'a> VisitMut<'a> for ArrowFunctions<'a> {
    fn enter_node(&mut self, kind: AstType) {
        self.nodes.push(kind);
    }

    fn leave_node(&mut self, _kind: AstType) {
        self.nodes.pop();
    }

    fn visit_jsx_identifier(&mut self, ident: &mut JSXIdentifier<'a>) {
        let parent_kind = self.nodes.last().unwrap();
        let parent_parent_kind = self.nodes[self.nodes.len() - 2];
        // ...
        walk_jsx_identifier_mut(self, ident);
    }
}

Well if we go for this change it can be simplified into something more akin to this:

impl<'a> VisitMut<'a> for ArrowFunctions<'a> {
    fn visit_jsx_identifier(&mut self, ident: &mut JSXIdentifier<'a>) {
        let parent = self.ctx.semantics().nodes().parent(ident.ast_node_id());
        let parent_kind = parent.kind();
        let parent_parent_kind = self.ctx.semantics().nodes().parent_kind(parent.ast_node_id());
        // ...
        walk_jsx_identifier_mut(self, ident);
    }
}

Keep in mind that the example above is only using its parent's kind, If we want the actual parent node in VisitMut we have to do crazy things like having a reverse lookup table, searching through the nodes, or trying to achieve the same thing with only scope information. While having the ast_node_id would make it a simple array indexing.

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 26, 2024

Even in Visit which has the node reference passed through enter/leave_node, the need to keep a stack of parents while there is already such representation in semantics is counterproductive. It may also have minimal impact on performance and memory footprint while scaling but I don't find this part problematic. Not to mention the loss of extra information because of lack of access to the ast_node_id, basically if we want to add any complex analysis in which its result is needed further down the road would make for a good mind-boggler.

I guess it makes sense that Rust does it; since it needs complex analysis. Roc and Elm get away with this because of their language design.

If you are worried about having extra fields in the AST, We may be able to make it optional via features if we make the AstNodeId non-clonable I guess it would result in the cell needing a mutable reference to get the node_id without swapping it and it would limit its scope to the VisitMut trait which would result in it not being necessary for basic semantic analysis and tree traversal. I'm not really into having this field as a feature but it shouldn't be a problem implementing either.

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 26, 2024

In some cases we can intercept the parent using a visit on a higher node, for example instead of detecting a function parameter and then getting its parent function node, We can detect functions and search for a viable function parameter. I strongly believe that these cases should be written/ported from Babel in a manner that keeps the data flow local, We should go for using semantics information if it is absolutely necessary however it doesn't mean that we won't need it to implement some of these cases.


Edit

In other situations, it may result in cleaner code which could be favored over data locality.

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 26, 2024

Well if we go for this change it can be simplified into something more akin to this:

impl<'a> VisitMut<'a> for ArrowFunctions<'a> {
    fn visit_jsx_identifier(&mut self, ident: &mut JSXIdentifier<'a>) {
        let parent = self.ctx.semantics().nodes().parent(ident.ast_node_id());
        let parent_kind = parent.kind();
        let parent_parent_kind = self.ctx.semantics().nodes().parent_kind(parent.ast_node_id());
        // ...
        walk_jsx_identifier_mut(self, ident);
    }
}

Just want to point out an interesting approach, If we implement the ast_node_id method via the AstNode trait we can have methods and functions like parent take advantage of generics which makes a nicer interface.

impl<'a> VisitMut<'a> for ArrowFunctions<'a> {
    fn visit_jsx_identifier(&mut self, ident: &mut JSXIdentifier<'a>) {
        let parent = self.ctx.semantics().nodes().parent(ident);
        let parent_kind = parent.kind();
        let parent_parent_kind = self.ctx.semantics().nodes().parent_kind(parent);
        // ...
        walk_jsx_identifier_mut(self, ident);
    }
}

The AstNode trait can also be implemented for AstKind to expose the underlying id which would make them interchangeable in the context of semantics and syntax tree.

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 26, 2024

We can detect functions and search for a viable function parameter.

Just to make it clear, This approach is necessary if we want to replace the node instead of mutating it in place. So I believe that it makes sense to default to this when appropriate. At least for the minifier and transformers! Linters have more leeway for having declarative implementations.

@Boshen
Copy link
Member

Boshen commented Mar 29, 2024

How do we deal with enum types? The current approach has enum types in the parent pointing tree, it's going to be a impractical to migrate it (I think).

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 29, 2024

The current approach has enum types in the parent pointing tree

Can you give an example of these enums? Several types are enum but I can't find why it would conflict with them.
Let's say we want to change the Declaration enum to use this approach.

pub enum Declaration<'a> {
    VariableDeclaration(Box<'a, VariableDeclaration<'a>>),
    FunctionDeclaration(Box<'a, Function<'a>>),
    ClassDeclaration(Box<'a, Class<'a>>),
    UsingDeclaration(Box<'a, UsingDeclaration<'a>>),

    TSTypeAliasDeclaration(Box<'a, TSTypeAliasDeclaration<'a>>),
    TSInterfaceDeclaration(Box<'a, TSInterfaceDeclaration<'a>>),
    TSEnumDeclaration(Box<'a, TSEnumDeclaration<'a>>),
    TSModuleDeclaration(Box<'a, TSModuleDeclaration<'a>>),
    TSImportEqualsDeclaration(Box<'a, TSImportEqualsDeclaration<'a>>),
}

As you can see all of these enum variants would contain a node at the end, So we can use the underlying node's ID, Here's how we can implement this trait for enums:

trait AstNode {
    fn ast_node_id(&self) -> AstNodeId;
}

impl<'a> AstNode for Declaration<'a> {
    fn ast_node_id(&self) -> AstNodeId {
        match self {
                VariableDeclaration(node) => node.ast_node_id(),
                FunctionDeclaration(node) => node.ast_node_id(),
                ClassDeclaration(node) => node.ast_node_id(),
                UsingDeclaration(node) => node.ast_node_id(),
                // ....
        }
    }
}

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 29, 2024

@Boshen @milesj @Dunqing If I can get a little more feedback here and a 2 for 3 👍I'll start working on a prototype PR so we can find edge cases if any.

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 29, 2024

@Boshen

/// Array Expression Element
#[derive(Debug, Hash)]
#[cfg_attr(feature = "serialize", derive(Serialize))]
#[cfg_attr(feature = "serialize", serde(untagged))]
pub enum ArrayExpressionElement<'a> {
    SpreadElement(Box<'a, SpreadElement<'a>>),
    Expression(Expression<'a>),
    /// Array hole for sparse arrays
    /// <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Trailing_commas#arrays>
    // Serialize as `null`. `serialize_elision` in serialize.rs.
    #[cfg_attr(
        feature = "serialize",
        serde(serialize_with = "ArrayExpressionElement::serialize_elision")
    )]
    Elision(Span),
}

The Elision variant here is the only example I found which needs refactoring; Since Span isn't an AstNode and shouldn't be we have to add an ElisionElement similar to SpreadElement and RestElement.


Edit:

We can also rewrite it as such:

    Elision(Span, AstNodeId),

@Boshen
Copy link
Member

Boshen commented Mar 29, 2024

The Elision variant here is the only example I found which needs refactoring;

Feel free to create another issue for this change.


You may have missed the thread on discord https://discord.com/channels/1079625926024900739/1222938985224081528

I'll create an umbrella issue for progress report and a milestone to officially kick off new transformer implementation.

@rzvxa
Copy link
Contributor Author

rzvxa commented Mar 29, 2024

Feel free to create another issue for this change.

I think it is a part of this issue and just something we have to do in order to get this working. It won't have any concerning effect on the other parts, We almost always ignore elision elements. It is just there to fill a space in the array.

You may have missed the thread on discord https://discord.com/channels/1079625926024900739/1222938985224081528

Yes, I saw this after commenting here, I don't have a habit of checking my discord servers(I still have notifications for directs).


I guess It's decided then, I'll start by creating a PR for this change. As soon as it gets to a working state I'll mention you guys for feedback.

@Boshen Boshen added this to the Transformer Milestone 1 milestone Mar 29, 2024
@Boshen
Copy link
Member

Boshen commented Mar 29, 2024

FYI I created the AST builder back in the day envisioning that we are going to add node ids ... I think this is the day. Good luck!

@rzvxa
Copy link
Contributor Author

rzvxa commented May 11, 2024

@overlookmotel @Boshen I still think there is a use for this ID, If we have the node ID along with our nodes we can traverse through things like scopes with no issue(both in linters and transformers). It makes rules like #2637 simpler(since we can rely on scope data without having to do a lot of iterations).

Let me know what you think, Since we are doing a lot of changes on AST maybe it is a good time to also sneak this in.

@overlookmotel
Copy link
Contributor

I do not have a clear idea at present of best way to implement scope binding lookups in transformers. You may be right that an AST node ID is the best way. On the other hand, we may be able to find a more efficient way using the kind of "pointer tricks" that Traverse is using.

So... er... don't know!

But I've made a comment on #2861 to not delete the branch in case we want to come back to this. Could you also please make sure you keep a local copy of the branch too, just in case?

@rzvxa
Copy link
Contributor Author

rzvxa commented May 11, 2024

I'll fork the branch; Graphite might cause me to delete the local branch by accident.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants