Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The code generators should be AST to AST transformations #6

Open
Araq opened this issue Apr 30, 2020 · 6 comments
Open

The code generators should be AST to AST transformations #6

Araq opened this issue Apr 30, 2020 · 6 comments

Comments

@Araq
Copy link
Member

Araq commented Apr 30, 2020

Currently the JS / C / C++ code generators produce the code directly as strings/ropes so the result cannot be optimized further. This design is a messy legacy and prevents certain bugs from being fixed easily. A much more elegant design is to give the target language an internal representation (AST like) that we convert the Nim AST to. There would be a simple "IR to text" step automating trivialities such as seperating arguments with commas. The IR needs an escape hatch like cEmit so that the .emit statement can continue to work. Also, some parts of the code generation might not map easily to a structured IR such as the code generation for Nim's type information.

It's an open question if the C IR should be based on Nim's PNode structure, but currently I am leaning heavily to new dedicated tree structure that naturally supports goto statements, for example.

@disruptek
Copy link

I think it really needs to be a new IR because it has domain-specific problems^Wopportunities and the wins from separating the code seem so much greater than the reuse of admittedly well-trod PNode stuff.

The idea of making any change to PNode code (in order to support a new backend IR) sends a shiver down my spine. As a smaller and segregated entity, it can start off tight and evolve a little here and there without disrupting anything else.

@timotheecour
Copy link
Member

timotheecour commented May 11, 2020

using an IR will eliminate a whole class of errors:

  • most codegen errors (where C compilation fails)
  • pre-codegen errors (where cgen.nim bails with some unsupported node kind)

it also enables so many use cases (eg a real REPL, jit during VM, wasm targetting...), I think it's obvious that's the way forward.

  • We need an IR that can be translated to llvmIR + to js; it'd work like this:
nimcode => lex/semantic => IRNode (nim's IR)
IRNode => llvmIR => .o, .so, executable # for c,objc,cpp, emscripten, wasm backends
IRNode => js # for js backend

note that wasm can be generated from llvmIR (eg https://medium.com/@richardanaya/write-web-assembly-with-llvm-fbee788b2817) so there's no need for nim to worry about wasm, in theory. c, cpp, objc are irrelevant after the llvm IR stage, but they matter before IRNode is generated.

IRNode can be modeled after llvmIR, no need to re-invent wheel here; more precisely it should map to a subset of llvmIR and it should be easy to convert IRNode => llvmIR.

  • ideally this would 100% subsume https://github.com/arnetheduck/nlvm but nlvm could be a good resource to tap into /cc @arnetheduck

  • emit+asm is tricky but it's critical to keep that feature working (but ok to make adjustment to emit syntax if needed). One possibility is to hand off to a separate invocation of backend to process emit segments, eg:
    clang -S -emit-llvm section_extracted_from_emits.c and merge with rest of modules

  • to make transition practical, we need a way to specify whether to use old codegen or AST-AST codegen on a per module basis, so that real projects can opt-out of IRNode codegen for specific unsupported modules; these can happily co-exist, be compiled separately and linked into final binary

It's an open question if the C IR should be based on Nim's PNode structure, but currently I am leaning heavily to new dedicated tree structure that naturally supports goto statements, for example.

IMO that's a no-brainer, it should be an unrelated IRNode that PNode transforms to. It gives maximum flexibility in evolving PNode and IRNode independently. I'm not worried about cost PNode => IRNode conversion, this will be negligible.

some parts of the code generation might not map easily to a structured IR such as the code generation for Nim's type information.

I'm not seeing a problem. We just generate code that generates static data containing serialized PNimType. It can be done cleanly.

@arnetheduck
Copy link

Generally, transforming to an intermediate IR is a balance between information loss and simplicity - ie there are things that the language structurally enforces that will go missing in an IR form, depending on how it's chosen.

Often, optimizers then work backwards to recreate the missing information - a trivial example is liveness of variables - in Nim, a locally scoped variable in a block goes out of scope at block end but in the LLVM IR it lives for the duration of the function (mostly) and additional annotations are needed to trace the liveness back to use it in optimizations.

I would generally not base an IR on LLVM - it's too low-level for representing the language in a way that is useful to many IR-to-IR transformations that could otherwise be made - for example, it would be nice to reason about ownership, lifetimes, callbacks, closures etc in the IR - inlining a closure that does not escape is a typical thing I'd do in such a transformation - RVO another. It is also machine-specific (you can't port IR code between different machines) and uses a lot of pointer manipulations that would be confusing for backends. If you want to work with LLVM IR, just write your transformation in C++ and contribute it to the LLVM compiler - the work will be much more broadly applicable and useful.

I would also not base it on PNode - as @disruptek points out, it's better for many reasons if the two are separate, also because one can reason more clearly about what's allowed and what isn't, when the IR is smaller and more dedicated. PNode is too focused on the precise textual representation of the langauge and also what macros are supposed to be working with - it has a lot of distractions that makes it clunky and inconvenient in other contexts.

Finally, it's not impossible that each backend should have its own IR - ie representing C as a pure C IR would allow the C backend to make better C code as well by exploiting C-specific quirks - one can imagine for example that casting, temporaries etc could easily be done better if the IR understood C - but these kinds of transformations would not benefit the JS and LLVM backends at all.

Every IR will need a generic way to transport additional information to the layer below it and back. For example, to reason about alignment and object size, one needs to know what the C compiler thinks about type sizes etc, which depends on the compiler used and the flags passed to it. Likewise, features like header and emit are backend-specific but still need to "survive" through the transformations - source information is another such "metadata"-like structure. LLVM for example indeed has custom metadata support so that frontends and optimization passes can communicate.

@arnetheduck
Copy link

If anything, https://rust-lang.github.io/rfcs/1211-mir.html would be a better source of inspiration for an IR - it's still powerful enough to do transformation but a lot simpler than the full language, making backend implementation a breeze (including C and nlvm/llvm) - it's designed to allow reasoning about the application in a way that's similar to what drnim tries to achieve - doing it this way would make it easier to feed back information to the user as well - not-nil analysis, static overflow checking etc etc. The LLVM IR is too low level for this also.

@arnetheduck
Copy link

wasm

LLVM IR is register-based - WASM is stackbased - this discrepancy causes issues generating one from the other - again, a higher level IR would likely be easier if you want separate backends for these.

@stisa
Copy link

stisa commented May 23, 2020

Just wanted to say that I'd love to see nim move towards ast-to-ast backends, I played with the idea using wasm as a target and a simplified ast kind of like MIR is for Rust would surely help.
The generators can then, if needed, further fit the IR into a target-specific representation before the final binary/text generation step, to allow to make use of target specific optimization or eg converting to wasm stack based format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants