Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: A proposal for a better compiler IR #357

Open
bobbinth opened this issue Aug 22, 2024 Discussed in #342 · 0 comments
Open

RFC: A proposal for a better compiler IR #357

bobbinth opened this issue Aug 22, 2024 Discussed in #342 · 0 comments

Comments

@bobbinth
Copy link
Contributor

Discussed in #342

Originally posted by bitwalker July 21, 2023
While doing the refactoring work early this summer on the compiler, I made some compromises to keep large structural changes relatively contained, i.e. I didn't want downstream crates to change much, if at all possible. I wasn't super successful in that regard; while the codegen crates were largely unaffected, the IR crate was completely reworked, and knowing what I know now, I would have come at things a different way.

The main compromise I made was re-using the AST as the IR on which the compiler does the bulk of its work. What is currently in our ir crate is not actually the compiler IR, but rather the algebraic graph (AIR), i.e. the representation we want for codegen. However, most of the heavy lifting is done outside of the algebraic graph, on the AST.

There are a few issues with re-using the AST for our IR:

  • The inlining phase expands all abstractions, and due to the tree structure of the AST, is forced to duplicate nodes in many places. It is also forced to traverse that tree recursively in multiple places in order to apply certain optimizations, and the tree can get very deep, largely due to comprehensions. This performs badly in terms of both space and time, and could cause stack overflows depending on the depth of the AST.
  • Some optimizations such as global value numbering (for performing common subexpression elimination (CSE)) are impractical to apply, and would be awkward to write, since the AST is not dataflow-oriented. As re-using computations is important for performance, GVN+CSE is an optimization we would very much like to perform.

I'd like to propose that we plan to do the following:

  • Make the algebraic graph the IR of the codegen backend, keeping the AIR moniker
  • Introduce a new graph-structured IR for the middle-end which is dataflow-oriented, called MIR
  • Move all transformations/optimizations from the AST to MIR, along with more complex validations
  • Lower directly to MIR from the AST as soon as parsing is complete, with only minimal validations done at this stage

This would significantly reduce the complexity of the code currently in the middle tier (e.g. inlining), and would provide a much cleaner interface for writing additional optimizations/transformations. Much of the complexity in that code today exists because of the awkward way in which we have to traverse and modify the AST to perform analysis and effect desired changes. This would also have the effect of making the compiler snappier, and scale better (e.g. it fixes the potential for stack overflows during traversal).

We can do this in a few individual steps parallel to other AirScript development (with rough estimates on how long each would take):

  1. Move the algebraic graph (AIR) code out of the ir crate to a shared codegen crate which all of the codegen backend crates reference. (est. 1-2h)
  2. Implement MIR in the ir crate, but not yet used anywhere (est. 3d)
  3. Implement the lowering from AST to MIR, at this point the compiler still goes from AST to AIR directly (est. 2d)
  4. For each pass currently applied to the AST which performs some kind of optimization or transformation, implement the equivalent pass for MIR, along with corresponding tests. Each of these can be done in a separate PR (est. 1w)
  5. Implement the lowering from MIR to AIR (est. 1-2d)
  6. Update the compiler pipeline from AST->AIR to AST->MIR->AIR (est. 1h)
  7. Remove all of the now-unused AST code that handles optimization/transformation/etc. (est. 1h)

Since the level of abstraction in MIR would be the same as the AST, just structured differently, translating AST->MIR and MIR->AIR is trivial. Most of the development time required will be in handling the mechanical work of rewriting passes and tests in terms of the new structure, rather than implementing anything novel.

This proposal isn't necessary to implement right now, but I think we should start planning on it, and perhaps get steps 1-3 done so that we have a foundation on which to implement and test the remaining steps.

If anyone has any questions, I'm happy to answer them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant