Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(masm): refactor assembler into a more compiler-like architecture #1277

Merged
merged 69 commits into from
Apr 29, 2024

Conversation

bitwalker
Copy link
Contributor

@bitwalker bitwalker commented Mar 8, 2024

Important

This is a large PR, with many changes interlinked. I have broken up the changes into many smaller pieces that introduce them piece-by-piece for review, but those pieces may refer to thing that come in later commits, or be incomplete in and of themselves. This is intentional, and you should use the commits to understand the structure of the changes being made, and then attack the review however you prefer from there.

Please read the commit messages, as they also further introduce the work done and what I'm trying to convey in each particular chunk.

This PR is a major refactor of the assembler crate, with various bits of new functionality in the miden-core and miden-assembly crates. There are some changes made in other crates, but these are largely tests, where the interaction with the assembler has either changed in some way, or the output being tested has changed. The following section lists the most significant changes to be aware of - keep in mind, that as I laid out in my note at the top, the changes are described in more detail in the commit messages, so read those for more info, but I'm happy to answer any questions you may have as well.

Summary

Here's a high-level overview of what is contained in this PR:

  • A completely new MASM frontend (i.e. parser/lexer/diagnostics), which makes use of a formal LALR(1) grammar, and corresponding generated parser. The grammar and parser generator used is lalrpop, and the lexer generator is logos - both are chosen because of my prior experience in compiler projects with them. They are solid, well-built crates, with a decent community around them, and both support no-std builds.
  • Rewritten source-location tracking, which also uses a representation that the Miden compiler uses, which will make later work to propagate source location information from the compiler frontend through Miden Assembly trivial. Most importantly though, this enables precise diagnostics, with source code snippets. See the tests for examples of what these diagnostics look like in practice.
  • A new way of expressing errors and diagnostics in the assembler and core crates (and other crates as well, when we want to start using them there). This makes use of the thiserror and miette crates (forked versions currently, for no-std support, until error_in_core is stabilized). These enable very ergonomic error types (via thiserror and #[derive(Error)]), and the ability to layer diagnostics on those same types (via miette and #[derive(Diagnostic)]). The diagnostics infrastructure is quite configurable, and I suspect we may want to take advantage of its JSON reporting backend for running the VM in the browser, so that we can extract error information in JavaScript and render pretty errors in the browser-based editor.
  • A source code formatter for Miden Assembly and MAST (in its textual form). This is based on the algorithm described by Philip Wadler, commonly called Prettier. You interact with the pretty printer by implementing the PrettyPrint trait, which is implemented in miden-core under core/src/prettier, by describing how you want to lay out the components of the syntax to be rendered. The prettier algorithm then takes that description and renders it using a language-agnostic algorithm. I have implemented pretty printers for both MASM and MAST, but feel free to modify their implementations as you see fit. One of the interesting things you can do with it is describe alternate layouts, which the algorithm will choose between depending on the available width of the output window (which defaults to 80 columns). You can see this in effect with span in the case of MAST, which will alternatively render single- vs multi-line depending on the size of the span.
  • Syntax tree visitors for the MASM abstract syntax tree, which allows you to concisely express analyses and rewrites over the tree without having to implement the traversals by hand each time. This is used in a few places in the refactored code.
  • A new semantic analyzer, which performs a variety of first-pass validation checks after parsing MASM source code.
  • A largely rewritten set of AST nodes. Many of the previous ones are still there in some form (especially Instruction, which hasn't changed much), others are expanded with new functionality or metadata (e.g. InvocationTarget), and others are entirely new, or rewritten so as to be essentially new. Virtually all of these nodes now implement a Spanned trait which returns the source span for that node, or a default one if there is no source code.
  • A significant rewrite of the Assembler internals. Some parts of this are virtually untouched (particularly the translation to MAST, with the main exception being how procedure calls are handled). The way in which the assembler is instantiated in used is slightly different, but much more powerful now. These changes are all driven off the new "module graph", which is used for global inter-procedural analysis during assembly.

Changes to MASM Syntax

I'd like to devote a section just to this topic, as they are both a motivating reason for some of the changes, as well as a result of reimplementing compilation in a more principled fashion:

  • You may now express arbitrarily-long identifiers in MASM, e.g. the 255 character limit is removed.
  • You may now specify identifiers that were previously illegal by quoting them in double-quotes, e.g. export."_RNvCskwGfYPst2Cb_3foo16example_function" (an identifier generated from Rust's mangling scheme). The default "bare" identifiers still have largely the same constraints as before (start with an alphabetic character, contain only alphanumerics and underscores, etc.).
  • You may now order procedures in a module any way you like, i.e. you do not need to define a procedure before it is referenced. This allows you to organize modules as you see fit to group functionality. The inter-procedural analysis done during compilation will ensure that no cycles in the call graph are introduced.
  • Similarly, you may introduce imports, constants, and other top-level items in any order, i.e. it is now perfectly acceptable to place constant definitions close to their usages. This means you can also reference definitions out of order, as long as when the names are resolved there are no cycles (which is also detected and raised as a diagnostic).
  • It is expected that Kernel modules export their syscalls, and proc their private internal helper functions. I actually can't recall now if that was the case before, but it certainly is now.
  • All instructions which accept immediates, now also accept constant identifiers as well as literals. I think the majority of instructions had support for constants, but it is now universal. There is only one exception to this rule, and it could easily be removed, and that's for the exp.uXX instruction, which currently only accepts a literal.
  • Imports are now purely syntactic sugar - they have no bearing on the compiled code at all, as the imports are resolved to their actual definitions during compilation (no matter how many re-exports have to be followed, though again, cycles are detected and not allowed). We could easily add support for important constants as well, but I have not implemented that here.

I think that largely covers the changes to the surface syntax, at least that I can recall off the top of my head here.

Changes to MAST/Assembler

One of the major things that falls out of the Assembler refactoring, is that we now always visit procedures in reverse topological order, i.e. callees before callers. As a result, we always know the MAST root for every procedure being called in the program. Instead of inlining the body of every procedure at every call site, the assembler now emits PROXY blocks for procedures called via exec (and CALL for call, but that's not new). This has the effect of making the emitted MAST drastically smaller. I added support in the processor for executing proxy blocks (it was an error previously) - if that is not the correct behavior, we may want to revisit this as part of the larger MAST refactoring. The only downside is that the textual MAST now has proxy.HASH where there used to be inlined code, but even that is something we could easily address in the formatter if we actually have a problem with this. I have added a test helper though that makes expressing tests that expect certain textual MAST output that contains calls much more readable, and less fragile to changes in the standard library, etc.

The other significant change to the way the assembler works is that the procedure cache has been rewritten to be tightly integrated with the module graph. This enables some nice optimizations, and in particular, allows us to use plain integer identifiers rather than having to hash procedure ids. The downside is that the procedure cache cannot be shared across assembler instances - whether that's a problem in practice I think remains to be seen. I do think that we'll want to further refactor the assembler in the future to enable a greater degree of sharing of the module graph and procedure cache, but there is a tradeoff in performance either way, so I erred on the side of single-threaded use cases for now.

Diagnostics

This PR introduces two dependencies to aid in creating error types and diagnostics, thiserror and miette respectively. These are using forks I made from the latest versions of both crates, so as to add support for #![no_std] environments until the error_in_core feature stabilizes. But let's take a brief look at how these are used:

Defining an error

Let's use the parser as an example here, the following is a subset of it's definition that demonstrates how thiserror can be used to define the type, while simultaneously implementing From for any errors it encapsulates, as well as the Display, and Error traits:

#[derive(Debug, Clone, thiserror::Error)]
pub enum ParsingError {
    #[error("invalid token")]
    InvalidToken {
        span: SourceSpan,
    },
    #[error("unrecognized token: expected {}", expected.as_slice().join(", or ")))]
    UnrecognizedToken {
        span: SourceSpan,
        token: String,
        expected: Vec<String>,
    },
    ...
}

That's it! It makes defining error types for each task convenient, and provides a very natural way to display the data associated with an error as part of it's Display implementation.

Now let's extend our ParsingError type for use in our diagnostics system:

#[derive(Debug, Clone, thiserror::Error, Diagnostic)]
pub enum ParsingError {
    #[error("invalid token")]
    #[diagnostic()]
    InvalidToken {
        #[label("occurs here")]
        span: SourceSpan,
    },
    #[error("unrecognized token")]
    #[diagnostic(help("expected {}", expected.as_slice().join(", or ")))]
    UnrecognizedToken {
        #[label("lexed a {token} here")]
        span: SourceSpan,
        token: String,
        expected: Vec<String>,
    },
    ...
}

As you can see, this feels like a natural extension of thiserror, with support for decorating things that are convertible to SourceSpan as "labels" in the diagnostic output. Here's what an instance of the UnrecognizedToken error above looks like when rendered with source code:

  x unrecognized token
         ,-[test1737:2:37]
       1 |
       2 |         use.dummy::math::u64->bigint->invalidname
         :                                     ^|
         :                                      `-- lexed a -> here
       3 |
         `----
        help: expected "begin", or "const", or "export", or "proc", or "use",
      or end of file, or doc comment

I actually lifted the example above from our test suite (the module_alias test). These pretty errors are enabled by associated source spans with specific errors, and then pairing them with a reference to the source code to which those spans are derived, and this can be done in a few different ways.

MASM Serialization

While this PR does not remove serialization of MASM, it does change it, as the AST has changed, and some of the restrictions have been loosened. I think we should plan on removing serialization entirely once the MAST refactoring is done, which should now be trivial with the changes contained in this PR having already set the stage for it.

TODO

There are three things that still need to be done, and they are small, but I want to call them out here:

  • Need to rebase on the most recent changes in next, I'm ~12 commits behind, and there are certainly conflicts there, but I wanted to tackle that once I got this open first.
  • There is a single failing test that I need to diagnose (in case one of you knows why it alone is failing!), which can be run with cargo test -p miden-stdlib -- crypto::sha256::sha256_hash_memory. It is failing with FailedAssertion { clk: <some number of cycles>, err_code: 0, err_msg: None }, but no additional info. I'm assuming it is due to my changes, but find it odd that all other tests pass and this one does not.
  • I would like to confirm that we actually want to support things like begin end or if.true end in the surface Miden Assembly syntax. I have chosen not to allow this in the parser, to keep the grammar simpler, and because it honestly doesn't make sense that anyone would write an empty block by hand, or what the purpose of doing so would be. This obviously has nothing to do with supporting empty blocks in MAST, only in MASM source code.
  • On a related note, I've also relaxed the behavior with regard to multiple imports of the same module with different aliases (as long as they are both used). This was an error, as of these changes that is no longer the case. We can certainly make it an error easily enough, but I feel there is a valid reason to allow this, and that we would only want to raise a warning/error if an import is unused (e.g. while refactoring a large program, you might move stuff using the old name to a new module that is aliased, while simultaneously importing the new module with its "true" name, and have all new code use that import, thus both imports are valid and useful, despite being overlapping.

TL;DR

This is an enormous PR, but it is more efficient to group these changes together like this than to have made them incrementally (which would have taken many weeks). That said, please take as much time as you need, I don't want this to be a nightmare to review. For that reason I spent an inordinate amount of time breaking this up into smaller commits that can be reviewed piecemeal, but stack in such a way that you can pretty much just start at the oldest commit and work forward and have it go fairly painlessly.

I'm also happy to walk through the changes or any questions you have on a call, in Discord, etc., if you feel it will be easier to have me on hand to answer questions you have. I'll literally pair up for the whole review if you want - I opened this monstrosity, so its only fair I go out of my way to make it easy on the rest of you. Just let me know!


I ran the benchmarks, and here's what I'm seeing compared to next:

❯ cargo bench -p miden-vm -- --save-baseline assembler-refactor
warning: src/github.com/0xpolygonmiden/miden-vm/miden/Cargo.toml: unused manifest key: bench.0.debug
   Compiling miden-vm v0.8.0 (src/github.com/0xpolygonmiden/miden-vm/miden)
    Finished bench [optimized] target(s) in 2.84s
     Running benches/program_compilation.rs (target/release/deps/program_compilation-f49d9b2efbd5811d)
Gnuplot not found, using plotters backend
program_compilation/sha256
                        time:   [5.1606 ms 5.1667 ms 5.1736 ms]
                        change: [-96.568% -96.560% -96.553%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

     Running benches/program_execution.rs (target/release/deps/program_execution-2647b9ea5dc36e7d)
Gnuplot not found, using plotters backend
program_execution/sha256
                        time:   [15.236 ms 15.283 ms 15.334 ms]
                        change: [-3.0510% -2.4781% -1.8854%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

@bitwalker bitwalker added the assembly Related to Miden assembly label Mar 8, 2024
@bitwalker bitwalker self-assigned this Mar 8, 2024
@bitwalker bitwalker force-pushed the bitwalker/assembler-refactor branch 3 times, most recently from b3a3ff5 to 5c8cddf Compare March 8, 2024 05:40
@bobbinth bobbinth requested a review from plafer March 8, 2024 05:40
Copy link
Contributor

@plafer plafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome change! We get all the benefits of using a parser generator - cleaner code, and more flexible MASM programs (e.g. ordering procedures however we like).

I didn't look in detail in how the semantic analyzer works. Unit/integration tests are green, so I'm happy on that front - I'll deep dive if ever I need to make a change.

Otherwise left a couple of nits. Great job!

assembly/src/assembler/context.rs Outdated Show resolved Hide resolved
assembly/src/assembler/id.rs Show resolved Hide resolved
assembly/src/assembler/mod.rs Show resolved Hide resolved
assembly/src/parser/error.rs Outdated Show resolved Hide resolved
assembly/src/parser/error.rs Outdated Show resolved Hide resolved
@bitwalker bitwalker force-pushed the bitwalker/assembler-refactor branch 2 times, most recently from dda2489 to e240e4e Compare March 9, 2024 05:12
@bitwalker bitwalker marked this pull request as ready for review March 9, 2024 07:55
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you! I've only started my review and looked through the simple changes so far (e.g., reviewed changes in the stdlib, processor, and verifier crates) - but wanted to leave a few comments already (mostly nits so far).

assembly/Cargo.toml Show resolved Hide resolved
assembly/Cargo.toml Outdated Show resolved Hide resolved
test-utils/Cargo.toml Outdated Show resolved Hide resolved
core/src/operations/decorators/advice.rs Outdated Show resolved Hide resolved
stdlib/docs/crypto/hashes/blake3.md Outdated Show resolved Hide resolved
assembly/src/assembler/id.rs Outdated Show resolved Hide resolved
@bitwalker bitwalker force-pushed the bitwalker/assembler-refactor branch from 878571e to c1cea0f Compare March 10, 2024 20:39
let mut expect_cycle = false;
if roots.is_empty() {
expect_cycle = true;
roots.extend(graph.nodes.keys().next());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: could we have more than one cycle? if yes, would it make sense to report all possible cycles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly possible to have multiple cycles simultaneously, depending on what has changed since the last compilation (if any). However detecting multiple cycles is not as simple as what is being done here, which relies on the simple property that failing to produce a topological sort means there is at least one cycle in the graph, but we don't actually try to work out the actual cycles. To actually identify all the cycles, we'd need to run something like Kosaraju's strongly-connected components algorithm, that identifies subgraphs that contain at least one cycle, but doesn't identify all of the cycles. Johnson's algorithm to find all of the elementary cycles of a graph is also an option, but is somewhat complex to implement.

TL;DR: we probably should hold off on implementing a fancier diagnostic here until we have a sense of how frequently this issue arises, and how often it involves multiple cycles vs accidentally introducing recursion in a new procedure. But if we feel there it is definitely worthwhile to do now, that's fine too.

/// When true, this permits calls to refer to procedures which are not locally available,
/// as long as they are referenced by MAST root, and not by name. As long as the MAST for those
/// roots is present when the code is executed, this works fine. However, if the VM tries to
/// execute a program with such calls, and the MAST is not available, the program will trap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have a concept of trap in the VM? I think the execution just fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed we handled it somewhat sanely rather than panicking, hence my description of the behavior as a "trap", which refers to anything that causes execution to halt unexpectedly, but in a controlled manner (i.e. the VM doesn't just segfault or something wild like that). If that's not the case today, it probably should be, since I anticipate this scenario could arise frequently in practice.

Copy link
Contributor

@hackaugusto hackaugusto Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just return an error, i.e. the VM errors

assembly/src/ast/instruction/advice.rs Show resolved Hide resolved
assembly/src/ast/invocation_target.rs Show resolved Hide resolved
assembly/src/ast/module.rs Show resolved Hide resolved
assembly/src/ast/op.rs Show resolved Hide resolved
@@ -481,29 +463,29 @@ fn calculate_clz(span: &mut SpanBuilder) -> Result<Option<CodeBlock>, AssemblyEr
let ops_group_2 = [
Push(Felt::new(u32::MAX as u64 + 1)), // [2^32, pow2(32 - clz), n, clz, ...]

Dup1, Neg, Add, // [2^32 - pow2(32 - clz), pow2(32 - clz), n, clz, ...]
// `2^32 - pow2(32 - clz)` is equal to `clz` leading ones and `32 - clz`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobbinth can we please add the pre-commit checks to CI? These changes are happening across all our repos. It is cluttering the git history, adding extra work on rebase, and makes PRs like this bigger than necessary because not everyone is running cleanup on their tooling.

Running the pre-commit hooks will fix these issues, since the CI will enforce the author properly cleans their code. (I personally don't care for pre-commit hooks, it is just a tool for this, we can use any other tool that removes whitespace).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there were quite a few places, almost every module I touched honestly, that had some kind of formatting discrepancy, mostly really small things like this example, but having some kind of CI check would be helpful. I don't think pre-commit hooks are necessary as long as the PR status checks fail due to this stuff (and is more of a impetus for those that don't set up their editors to do this stuff for them, to do so and avoid the headache).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think pre-commit hooks are necessary as long as the PR status checks fail due to this stuff

By pre-commit hook I mean the tool. I have added the configuration for it a while back. People don't have to install the tool if they don't want (and we dont' even have to use this tool to check the format in the CI), it is just an easy way of get this setup since the config is there already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add this to the README - I didn't know about that tool, or that we had that configuration already setup. I'll set it up now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, makes total sense

assembly/src/assembler/id.rs Outdated Show resolved Hide resolved
assembly/src/assembler/mod.rs Outdated Show resolved Hide resolved
assembly/README.md Outdated Show resolved Hide resolved
core/src/utils/mod.rs Outdated Show resolved Hide resolved
Comment on lines +27 to +28
thiserror = { git = "https://github.com/bitwalker/thiserror", branch = "no-std" }
miette = { git = "https://github.com/bitwalker/miette", branch = "no-std" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move these forks to the https://github.com/0xPolygonMiden org?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say yes? I guess the downside is that means all of the Polygon org rules are going to apply, so someone else will have to be involved to sign off on any updates we do to those crates, but they aren't likely to change much in the near term.

assembly/src/parser/error.rs Show resolved Hide resolved
assembly/src/parser/error.rs Show resolved Hide resolved
Comment on lines 41 to 55
"adv.insert_hdword" => Token::AdvInsertHdword,
"adv.insert_hperm" => Token::AdvInsertHperm,
"adv.insert_mem" => Token::AdvInsertMem,
"adv_loadw" => Token::AdvLoadw,
"adv_pipe" => Token::AdvPipe,
"adv_push" => Token::AdvPush,
"adv.push_ext2intt" => Token::AdvPushExt2intt,
"adv.push_mapval" => Token::AdvPushMapval,
"adv.push_mapvaln" => Token::AdvPushMapvaln,
"adv.push_mtnode" => Token::AdvPushMtnode,
"adv.push_sig" => Token::AdvPushSig,
"adv.push_smtpeek" => Token::AdvPushSmtpeek,
"adv.push_smtget" => Token::AdvPushSmtget,
"adv.push_smtset" => Token::AdvPushSmtset,
"adv.push_u64div" => Token::AdvPushU64Div,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobbinth should we standardize the adv_loadw, adv_pipe, adv_push instructions to match the rest? (Or even better, drop . so that it is used to separate only arguments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought, I was actually a bit surprised that those couple of instructions didn't follow the same convention, but didn't think it was appropriate to change that as part of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason why these are different is to differentiate between real instructions and decorators. So, anything which is adv.xxx is a decorator (i.e., it does not advance the clock cycle of the VM and does not affect the VM state outside of the advice provider). The adv_xxx are actual instructions which affect the state of the operand stack and/or memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this distinction useful though? And if it is, is a single . the best way to express this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is useful as something like adv_push has a very different semantic from adv.push_u64div - so, it is good to have syntactic differentiation. Whether using . is the best way to do that - I'm not sure. I picked it because conceptually in adv.[action] the action parametrizes what we should do to the advice provider (this is consistent with how we specify parameters for other instructions). But I'm open to other ways of doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my first impression upon seeing it is that I would assume they are both instructions, so we probably should differentiate decorators further somehow, such as prefixing them with @ or something, e.g. @adv.push_u64div visually stands out as something special compared to regular instructions, and also makes it easy to spot them when skimming the code.

If we want to distinguish them somehow, I would suggest a prefix symbol from the set @, !, % or $. I'd be inclined to use @ simply because it is used for attribute/decorator-ish things in other languages, but any of those symbols are free of any ambiguity issues in the current grammar.

Comment on lines +500 to +518
#[inline]
Assert: Instruction = {
"assert" <error_code:MaybeAssertCode> => error_code.map(Instruction::AssertWithError).unwrap_or(Instruction::Assert),
"assertz" <error_code:MaybeAssertCode> => error_code.map(Instruction::AssertzWithError).unwrap_or(Instruction::Assertz),
"assert_eq" <error_code:MaybeAssertCode> => error_code.map(Instruction::AssertEqWithError).unwrap_or(Instruction::AssertEq),
"assert_eqw" <error_code:MaybeAssertCode> => error_code.map(Instruction::AssertEqwWithError).unwrap_or(Instruction::AssertEqw),
"u32assert" <error_code:MaybeAssertCode> => error_code.map(Instruction::U32AssertWithError).unwrap_or(Instruction::U32Assert),
"u32assert2" <error_code:MaybeAssertCode> => error_code.map(Instruction::U32Assert2WithError).unwrap_or(Instruction::U32Assert2),
"u32assertw" <error_code:MaybeAssertCode> => error_code.map(Instruction::U32AssertWWithError).unwrap_or(Instruction::U32AssertW),
}

MaybeAssertCode: Option<Immediate<u32>> = {
"." "err" "=" <code:ImmValue<U32>> => Some(code),
=> None,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should forbid assert.err=0, because 0 is used as the default for a plain assert. An alternative would be to have the error in the VM be a Option<u32> instead. So the default is set to None instead.

My rationale is that these two error spaces should not collide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with that as well, though I think what I really want is to associate something more concrete, like a source span, with an assertion error. Even with the full 32-bit range for error codes, it doesn't really help you find where the error was raised (and that assumes that each possible location an error can occur uses a unique error code), and also doesn't account for all of the macro-like instructions that expand to some set of instructions that contain asserts (and is thus not visible in the original source code, nor do we - currently anyway - assign those asserts distinct codes). It's obviously better than nothing, but when I was debugging one issue last week, I definitely felt like I was looking for a needle in a haystack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally agree. Ideally we would have some stack tracing too. I think the issue with that is that MAST erases the procedure names? But yeah, I'm 100% with you, would love to have additional information for errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main motivations for limiting error codes to 32 bits is to keep the size of programs relatively small which is important for putting programs on chain. Otherwise, with this new structure, nothing really prevents us from using strings or more elaborate identifiers. A possible strategy to adopt is to have more elaborate identifiers but then strip them out for on-chain programs during serialization. But that's more related to MAST serialization than this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SourceSpan type is 32 bits, but in order to be useful, we also need to have the source code (path or actual content), so that we can render an error with a source snippet (see the assembler tests for examples of what these look like if you haven't seen them yet). We don't actually need identifiers (file/line/col) at each site. The source code path (or content) could be provided with the MAST by mapping each MAST root corresponding to a procedure with the source code for that procedure when built with debug info, allowing us to render source code snippets by looking up the source for the current procedure (if available) and then rendering the span using the source span encoded in the instruction.

Of course, the problem occurs when the source code is not available, still, the source span value itself forms a useful and semi-unique error code (i.e. identical spans with different errors are possible, since spans are unique only within a file, but they probably aren't super common).

If we used 64 bits for errors, then 32 bits could be the user-provided code, and 32 bits could be the span. We could even omit the 32 bits used for custom codes when no code is used, and so save space in the MAST. The main thing we'd need to do though is ensure the span is not part of the hash, so as to avoid making two otherwise identical functions have unique hashes. It's not a huge issue if the span ends up in the hash, but it would preclude two functions defined separately from hashing to the same value (unless they just happened to occur in the same place in both files, but that's super unlikely).

assembly/src/parser/grammar.lalrpop Outdated Show resolved Hide resolved
assembly/src/parser/grammar.lalrpop Outdated Show resolved Hide resolved
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not a full review - but I'm making steady progress. I've reviewed changes in all crates except for the assembly crate and started to make headway into the assembly crate as well.

I left some comments inline - most of them a pretty minor.

miden/Cargo.toml Outdated Show resolved Hide resolved
Comment on lines 7 to 11
pub use assembly::{
ast::{ModuleAst, ProgramAst},
Assembler, AssemblyError, ParsingError,
self,
ast::{Module, ModuleKind},
diagnostics, Assembler, AssemblyError,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why export both self and submodules from here? Should we just do:

pub use assembly;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I would be inclined to do, but I guess I didn't want to remove some of the "major" types from the crate root, so as to avoid more downstream changes, but honestly it's not a big deal to fix those up, so if you'd prefer to just export the assembly module and drop the top-level re-export of things like Module and Assembler, that's fine by me!

core/src/utils/mod.rs Outdated Show resolved Hide resolved
core/src/prettier/mod.rs Outdated Show resolved Hide resolved
core/src/prettier/mod.rs Outdated Show resolved Hide resolved
core/src/utils/mod.rs Outdated Show resolved Hide resolved
@@ -1,6 +1,6 @@

Initializes four memory addresses, provided for storing initial 4x4 blake3<br />state matrix ( i.e. 16 elements each of 32 -bit ), for computing blake3 2-to-1 hash<br /><br />Expected stack state:<br /><br />[state_0_3_addr, state_4_7_addr, state_8_11_addr, state_12_15_addr]<br /><br />Note, state_`i`_`j`_addr -> absolute address of {state[i], state[i+1], state[i+2], state[i+3]} in memory | j = i+3<br /><br />Final stack state:<br /><br />[...]<br /><br />Initialized stack state is written back to provided memory addresses.<br /><br />Functionally this routine is equivalent to https://github.com/itzmeanjan/blake3/blob/f07d32e/include/blake3.hpp#!L1709-L1713<br />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still something off with how we process documentation. This comment belongs to an internal procedure and should not be printed out - but instead, it seems to be printed out as module-level comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way that can happen (unless there is a bug, which I will check here) is if the doc comment for the local procedure starts on the first line of the module. The only way a doc comment gets associated with the module and not a procedure is if it starts on the first line of the module and there is no separate doc comment for the procedure (or if someone manually sets the docs for the module). I had to do it this way because it is otherwise ambiguous whether the first doc comment in a file is associated with a module or the first documentable item in the module.

Rust distinguishes doc comments for modules vs module-scoped items with different prefixes, i.e. //! vs ///. We may want to consider doing that if there are cases where the ambiguity causes a mixup like this.

Copy link
Contributor

@Fumuran Fumuran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome!

core/src/utils/mod.rs Outdated Show resolved Hide resolved
assembly/src/parser/grammar.lalrpop Show resolved Hide resolved
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still making may way through the parser logic, and I left a few more comments inline.

Also, I pushed a couple of commits which cleaned up comment formatting and added section separators in a few files.

assembly/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 33 to 40
pub use self::assembler::{ArtifactKind, Assembler, AssemblyContext};
pub use self::ast::{Module, ProcedureName};
pub use self::errors::AssemblyError;
pub use self::library::{
Library, LibraryError, LibraryNamespace, LibraryPath, MaslLibrary, PathError, Version,
};
pub use self::parser::{SourceLocation, SourceSpan, Span, Spanned};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use self:: here?

Also, do we need to export ArtifactKind publicly? If so, maybe we should come up with a more "self-describing" name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use self:: here?

As mentioned in another related comment, this is actually what our conventions as enforced by rustfmt/rust-analyzer dictate (though the configuration is disabled for some reason in this repo). There are other reasons why self is used, which I mentioned in my other comment, but in general it's good practice to use self like this rather than leaving them unqualified - both as a less ambiguous syntax, and to aid in skimming import lists to determine local vs external imports at a glance.

Also, do we need to export ArtifactKind publicly? If so, maybe we should come up with a more "self-describing" name?

A couple of the lower-level public assembler APIs take this as an argument, so we need to export it. We could probably re-use ModuleKind though, however I wanted to distinguish the two because the artifacts produced by the assembler may be a different set than the types of modules that can be defined. I'm open to alternatives though. This kind of thing is typically called an "artifact" or "object" in compiler/assembler/linker contexts, and is inherently a bit general, but we might be able to be more specific in our case.

assembly/build.rs Show resolved Hide resolved
assembly/src/lib.rs Outdated Show resolved Hide resolved
assembly/src/parser/mod.rs Outdated Show resolved Hide resolved
assembly/src/parser/mod.rs Show resolved Hide resolved
assembly/src/parser/error.rs Outdated Show resolved Hide resolved
core/src/operations/decorators/mod.rs Outdated Show resolved Hide resolved
core/src/prettier/print.rs Outdated Show resolved Hide resolved
@bitwalker bitwalker force-pushed the bitwalker/assembler-refactor branch from bcfe0e9 to c639420 Compare April 29, 2024 05:39
@bitwalker
Copy link
Contributor Author

This is ready now and has been rebased on latest next, all review comments have been addressed, aside from the removal of the Assembler::*_in_context methods and the AssemblyContext, as that is a larger change that feels like it belongs in a separate PR once this is merged.

A number of improvements and other features/changes have been discussed here, but I believe everything we intended to tackle as part of this PR is more or less complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assembly Related to Miden assembly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants