-
-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(ast): rework enum of structures #2876
Conversation
rzvxa
commented
Mar 31, 2024
•
edited
Loading
edited
- ForStatementInit
- MemberExpression
- PropertyKey
- Argument
- AssignmentTarget
- SimpleAssignmentTarget
- ArrayExpressionElement
- Expression
- ModuleDeclaration
- JSXElementName
- JSXAttributeItem
- JSXMemberExpressionObject
- TSTypeName
- Statement (not needed but we might also refactor it for the good measure)
This stack of pull requests is managed by Graphite. Learn more about stacking. |
CodSpeed Performance ReportMerging #2876 will degrade performances by 3.62%Comparing Summary
Benchmarks breakdown
|
Personally, it looks good to me. It's the same as how |
We may need to consider this #2854 |
That's a good point; I didn't think of |
I'll give it a read. Since @overlookmotel is also currently working regularly on AST for better napi performance we might want to also wait for him to give some feedback. We may have to take things like #2457 and #2847 into consideration. |
cross post from discord: Can we somehow validate whether node ids are going to work before going down this rabbit hole? Can we somehow reduce the surface area for validation, find a small real example where we need the id and do all sorts of operations? So rustc doesn't have any concrete AST nodes that are enums https://doc.rust-lang.org/beta/nightly-rustc/rustc_ast/ast/index.html#enums, I think this is the right direction, but I don't want us getting burned out with these large changes and then finding out it's not going to work |
I'll continue this conversation over here for historical reasons, Yes it is possible, I'll do it for a single enum type to make sure it works. |
@rzvxa Thanks for looping me in. I'm not aware of the context, so having a hard time understanding the motivation for this change. Would you be able to point me in direction to the Discord conversation so I can read up? FYI, the problem with enums I've been having in #2457 is: Need to be able to determine what niche value is used for You can make this deterministic by making enums However, when you have an enum within an enum e.g. I think #2847 is likely a dead end, so I'll have to find another way, and maybe we just have to accept that while it could break in a new Rust release, in practice it probably won't. But upshot is... if in the process of the changes you're planning to make, it'd be possible/easy to remove all nested enums, that would solve a problem for AST transfer. |
And... just to throw one more thing into the mix: If we're planning to make changes to enums, we could also consider trying to reduce the type sizes at the same time. This is a fairly common pattern in Oxc's AST: oxc/crates/oxc_ast/src/ast/js.rs Lines 466 to 469 in 619fc61
I imagine this would improve performance somewhat, but question is whether it can be made ergonomic - you then can't use This may not be worth doing at this stage (or maybe at any stage), but I just thought I'd mention it in case it's useful to have in mind. |
Gladly, this PR aims to resolve some of the roadblocks in the progress of #2818.
Does this also get squashed or does it stay intact? Enum1::VariantA(StructA { foo: Enum2::VariantB(something) }) Because with this change we eliminate most if not all of the nested enums to something more akin to the pseudocode mentioned above. |
What is it going to look like? Do we essentially have to implement our own tagged union? Or is it about making a custom Box that would handle the inner with raw pointers? I'm having a hard time imagining the end result. |
Thanks. I'm on holiday at the moment, but will read up soon as I get a chance.
It appears it does get squashed. Rust playground In general the squashing is good, as it minimizes type sizes. But, for my purposes, this: enum Foo {
A(Box<A>),
B(Box<B>),
C(Box<C>),
D(Box<D>),
} is preferable to: enum Foo {
A(Box<A>),
B(Box<B>),
Bar(Bar),
}
enum Bar {
C(Box<C>),
D(Box<D>),
} i.e. do the squashing explicitly, rather than leave it to the compiler. But like I said, I'm kind of resigned at this point to a more hacky/fragile solution to this problem than I'd ideally like. We shouldn't bend everything around the needs of AST transfer. I just mention it because if the new design was moving in that direction anyway, it'd be great to solve this problem at the same time. But if it's not, don't worry about it.
AST types don't have |
Wow! then thanks for answering on your holidays.
Well, It's literally the solution to another issue(which is having concrete AST nodes so we can add their IDs to them). So if we come up with no alternative solution it is going to hit 2 birds with one stone, At least in theory we have to see how squashing behaves in more advanced situations.
I also like the idea of making our types more compact since it may help with cache misses in our native Rust world so its effect might not be limited to the transfer. However, Implementing Maybe we can find another way to pack some of our types without going full-on tagged unions? I mean on a case-to-case basis there might be some room for improvement. |
Maybe... but I suspect the compiler might be smart enough strip that out again when inlining By the way, my point about reducing type sizes is not related to AST transfer - it was a separate thought. Anyway, perhaps I've lead this too far off topic... Concerning this PR: There is one downside I can see. If you later add There's a hacky way to get around that: Playground But ooof! Not nice. I'm a bit unclear why the more obvious solution of storing Node IDs inside the variants' types themselves isn't workable. If problem is needing to access the ID cheaply, you could make the structs |
Nice! It was one of my concerns however, Doing it with
That's exactly what I was hoping to do, Sadly both in oxc and babel which we are porting a bunch of things from there, are assumptions about enum types. let's take the pub enum ForStatementInit<'a> {
VariableDeclaration(Box<'a, VariableDeclaration<'a>>),
Expression(Expression<'a>),
UsingDeclaration(Box<'a, UsingDeclaration<'a>>),
} When visiting our nodes we might visit the init expression as an Babel gets away with this since they are using the We still have transparent enums for example I like the way you think about problems, Please if you have any other approach to this let me know. They say when all you have is a hammer you see everything as nails, And in the world of compilers written in rust I'm most familiar with rustc so that's why I usually gravitate toward what is done there, Which might not be the most optimized way out there hence the compile times for rust projects(Although to be fair rust does a lot of work compared to js/ts compilers). |
Thanks for explaining. Now I understand the problem. But... don't you have a similar problem in other places e.g.: oxc/crates/oxc_ast/src/ast/js.rs Lines 1262 to 1266 in e76d70f
If you are visiting a Or: oxc/crates/oxc_ast/src/ast/js.rs Lines 888 to 894 in e76d70f
If you are visiting an Babel's On the problem of type sizes increasing, this is better than my previous hacky suggestion: // 16 bytes - Node ID fits after the discriminant
enum EnumWithId {
A { ast_node_id: u32, node: Box<NodeA> },
B { ast_node_id: u32, node: Box<NodeB> },
C { ast_node_id: u32, node: Box<NodeC> },
} But... if you nest the enums the way they are in the current AST, you don't get the "squashed" discriminant optimization, and the type increases to 24 bytes. You have to flatten into a single enum to keep the size down. |
You are absolutely right, This approach is only there to let us infer information about neighbors, and reference a particular node in our semantic, scope, and other data structures as SOA. For replacing nodes in the tree I was going to refactor the VisitMut, So it would always return a result, Then the result can be set in the implementor to be the unit but if a Visitor wants it can return the replacement nodes when visiting a single node. This way if we return a SmallVec we can substitute one node with one or many without keeping track of its index. At least that's the plan, Have to see how it works out.
It is a known issue with my proposal, For this, we have to visit the
I'll take this into consideration, Some nested enums are impossible to flatten for example we need some enum types that point to an expression, and bringing everything from an expression is an unreasonable solution for not much difference in the memory footprint, It can be used to make a few types compact tho. To be honest I won't go too hard on packing our structures right now, Since I prefer to bring it to a working state and then worry about this optimization. I would love to do it at some point and I need your help for doing so, since you clearly have thought about this for a while. |
Actually, I realized I am absolutely wrong! If you have a reference to the fn statement_index(stmt: &Statement, parent: &BlockStatement) -> usize {
(stmt as *const _ as usize - parent.body.as_ptr() as usize) / std::mem::size_of::<Statement>()
} Ditto deducing left or right of fn expr_is_left(expr: &Expression, parent: &BinaryExpression) {
expr as *const _ as usize
== parent as *const _ as usize + std::mem::offset_of!(BinaryExpression, left)
}
OK. That's fair. But in my opinion, this pattern is preferable to enum-within-struct, as it saves 8 bytes on most enums with little penalty in terms of usability: // 16 bytes
enum EnumWithId {
A { ast_node_id: u32, node: Box<NodeA> },
B { ast_node_id: u32, node: Box<NodeB> },
C { ast_node_id: u32, node: Box<NodeC> },
}
impl EnumWithId {
fn ast_node_id(&self) -> u32 {
// Compiler will compress this to `(self as *const _ as *const u32).add(1).read()`
// i.e. single instruction, no branch
match self {
Self::A { ast_node_id, .. } => *ast_node_id,
Self::B { ast_node_id, .. } => *ast_node_id,
Self::C { ast_node_id, .. } => *ast_node_id,
}
} https://godbolt.org/z/r9nEq8KPP In my view it is worthwhile avoiding an extra 8 bytes on every What do you think? |
I can't wrap my head around this one, How does it work? |
Well, I think this is approachable, Have to let @Boshen make the final call here but I'm with you on this one, I think it is a cleaver solution and the ugliness is mild. It doesn't even need a macro to hide the implementation since it is understandable and reasonable to work with. We still might want to use a macro like ast_match to hide the struct pattern so people can use it similarly to normal enums. And are we sure about the predictability of this solution when it comes to how the rust compiler packs the types? What happens to nested enums with this? |
|
@overlookmotel Oh boy, You are a knowledgeable person, I love talking with you. You just shine a new light through every problem. Thanks for the suggestion on finding the leaf position in the parent, It is remarkable! |
It's predictable enough. From first principles: enum EnumWithId {
A { ast_node_id: u32, node: Box<NodeA> },
B { ast_node_id: u32, node: Box<NodeB> },
C { ast_node_id: u32, node: Box<NodeC> },
}
So, to maintain the alignment constraints, the compiler has limited options: Firstly,
Then, of the remaining 8 bytes,
The compiler is free to put the discriminant anywhere in the But the point is: Whichever of the 4 possible representations above the compiler chooses, the That's all theoretical. In practice the compiler always chooses this repr (
But even if the compiler chose to do something different in future, all the possible options display the predictability we want. Important proviso: All this holds true as long as the "payload" has alignment 8 for all variants. This is true for almost all AST types because they pretty much all contain at least one
The compiler is not clever enough to combine the discriminants of the inner and outer enums, so it costs 8 bytes (for the inner enum's discriminant). But at worst, that cost just negates the saving you make from the optimization above. So in the most common cases, it saves you bytes, and in the nested enums it doesn't save you anything, but it doesn't cost you anything either.
That's really kind of you to say! But, just to be honest: Yes, I do know this specific area very well - because implementing AST transfer is all about understanding Rust's type layouts. But in other areas I still have a lot to learn. And I'm sure you can teach me! |
Actually, a correction: All the above assumes a 64-bit architecture where pointers are aligned on 8. But on WASM32, pointers are aligned on 4 I think, so theoretically the compiler could put |
Sorry, I just realized something: As far as I'm aware it's not necessary for it to be as big as If it is technically possible for some bonkers JS input to be |
First off thanks for the thorough explanation. For architecture, I think it is a sane thing to optimize for 64-bit since nowadays it is tough to find a workstation that isn't running on a 64-bit processor. WASM isn't the right environment for processing really huge JavaScript files to begin with.
It isn't a deal breaker, The Project is heading toward having everything feature-gated so it isn't so far fetch to add some
|
Right now I've shifted my focus toward figuring out the tree mutation with @milesj, I'll revisit this issue in the next week or so. I'll let you know when I have a PR containing any changes related to this subject. |