-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: initial package format documentation #129
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall! Please, check my notes.
|
||
wasm-tools component wit extracted-package/component.wasm | ||
|
||
The contents of `component.wasm` here is not an actual Wasm module, but simply the binary representation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here. Could you please clarify what do you mean by "binary representation of the WIT package"? As per https://github.com/WebAssembly/component-model/blob/main/design/mvp/WIT.md, WIT package is a WIT file or a folder of WIT files. I'm guessing it is a Wasm component, but you're using "WIT component" term further in the document which is not familiar to me.
If it is a compiled Wasm component, I'm not sure we need it in the Miden package. WIT package should be enough for the 1. purpose above and in 2. we don't need anything for the dependencies besides MAST roots and invocation method whick we want to encode in the WIT package itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, it's important to keep in mind that the package format must be:
- Compact
- Able to be created and consumed by hand, and by tooling in various languages (the full set of which is unknown, and which we do not control).
- Provide metadata sufficient to generate language-specific bindings to the interface exposed by that package, without the source code being available.
The binary format for WIT is a Wasm component type definition. There are multiple reasons why this form is preferred to the textual format:
- It is a more stable format (WIT will continue to evolve rapidly for some time, but the binary format is essentially final at this point)
- It is fully validated and processed, e.g. imports have been resolved. This means we can process the contents of a package without needing to do any of that work, which requires additional tooling.
- It is compact
- It is a richer format than WIT (things can be expressed in the binary format that cannot be in the textual format, at least at this point in time)
- It can be used to generate bindings without additional processing (to do so, the WIT must be processed into a form equivalent to what is represented in the binary format)
- The textual WIT can be recovered from the binary format if desired for readability (albeit not necessarily the exact same text, but an accurate representation nonetheless)
Raw textual WIT, in comparison, has basically none of those properties. A WIT "package" isn't really a thing, or more precisely, it is the specification for how WIT should be laid out on disk in a canonical form for processing into the binary format (which is what is actually published to the registry. See also, the wit
cli which is a prototype tool for interacting with a Wasm component registry.
we don't need anything for the dependencies besides MAST roots and invocation method whick we want to encode in the WIT package itself.
That information can't be represented in WIT alone. We could maybe encode the MAST roots in the binary format, I'd need to confirm that, but I'm not aware of any means by which we could do so in the textual format of WIT.
For the foreseeable future, interfaces described in WIT will always be call
ed. It won't be possible to express things like the standard library as a package. Package dependencies will always be true components, i.e. with shared-nothing semantics, and relying on the canonical ABI. For the time being, things like the standard library will have to be handled as native dependencies (e.g. Rust crates).
Relaxing this restriction will require us to introduce some new functionality to the VM (and/or modify existing functionality). One hack/workaround we've discussed is allowing one to express functions which are WIT-compatible, in WIT, but then use a custom bindings generator to emit bindings that operate under the assumption that the caller and callee are using the same shared memory (and which would be lowered as exec
rather than call
) - but at best that would have to be extra metadata in the package (probably with an associated feature flag), not something present in the WIT. The problem with that, however, is that it immediately breaks compatibility with existing tooling (for example, generating bindings from the WIT using a tool that is not aware of that custom logic would emit code for the caller that is incompatible with what is expected by the callee).
Until we can land on a solid longer term solution for this, and other related issues, we just simply can't support such packages. I would personally prefer to punt on it, until we revisit some details of how the VM manages contexts and call-like instructions, at which point we can provide ourselves with the primitives necessary to solve this properly, but it will be some time before we can do that. We don't have an urgent need to solve this problem right now anyway, so it isn't something I'm especially concerned about, but it will become more of an issue over time.
However, in addition to knowing MAST roots, we do need other dependency information. We may be able to encode basic information in the component definition (since it permits us a way to express component dependencies with both semantic versions and digests), but it is limited in what can be expressed, and cannot express dependencies on things which are not components, nor can it tell us how to fetch those dependencies. To use it as an example again, the wit
tool I linked earlier also demonstrates the necessity for this, as it also has a manifest to provide information about what packages to fetch and how, because WIT on its own doesn't provide all of the pieces.
In the format I've described so far, I've actually omitted the metadata for "how" to fetch a dependency (aside from vendored vs not-vendored dependencies), because we don't have any package management infrastructure or tooling defined yet. However, we will (long-term) need a way to express those details - just like Cargo does. There is a strong chance we will base this on what the Wasm component registry spec ends up looking like, but it's not a given. Either way, without that information, there will be no means by which to fetch a dependency graph from a given package (which may be located on disk, downloaded from an HTTP server, cloned from source control, or fetched from a proper registry). For that reason, some degree of dependency metadata is important to represent in the package manifest.
That said, if there is information being duplicated, and we can avoid doing so while keeping frequently-accessed metadata easily accessible, I would like to make sure we make any changes necessary to accomplish that goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for such a detailed reply!
First, it's important to keep in mind that the package format must be:
- Compact
- Able to be created and consumed by hand, and by tooling in various languages (the full set of which is unknown, and which we do not control).
- Provide metadata sufficient to generate language-specific bindings to the interface exposed by that package, without the source code being available.
The binary format for WIT is a Wasm component type definition. There are multiple reasons why this form is preferred to the textual format:
- It is a more stable format (WIT will continue to evolve rapidly for some time, but the binary format is essentially final at this point)
- It is fully validated and processed, e.g. imports have been resolved. This means we can process the contents of a package without needing to do any of that work, which requires additional tooling.
- It is compact
- It is a richer format than WIT (things can be expressed in the binary format that cannot be in the textual format, at least at this point in time)
- It can be used to generate bindings without additional processing (to do so, the WIT must be processed into a form equivalent to what is represented in the binary format)
- The textual WIT can be recovered from the binary format if desired for readability (albeit not necessarily the exact same text, but an accurate representation nonetheless)
Raw textual WIT, in comparison, has basically none of those properties. A WIT "package" isn't really a thing, or more precisely, it is the specification for how WIT should be laid out on disk in a canonical form for processing into the binary format (which is what is actually published to the registry. See also, the
wit
cli which is a prototype tool for interacting with a Wasm component registry.
I agree. I'll dig into wit
tool and warg packages. I suspect that I saw the WIT binary data and it is defined at the end of the Rust bindings source file, stored in the custom section of the Wasm component.
we don't need anything for the dependencies besides MAST roots and invocation method whick we want to encode in the WIT package itself.
That information can't be represented in WIT alone. We could maybe encode the MAST roots in the binary format, I'd need to confirm that, but I'm not aware of any means by which we could do so in the textual format of WIT.
Yes. Althoug I'm recalling that we discussed that we might be able to encode it as an attributes in the comments but they would not be in the WIT binary anyway.
For the foreseeable future, interfaces described in WIT will always be
call
ed. It won't be possible to express things like the standard library as a package. Package dependencies will always be true components, i.e. with shared-nothing semantics, and relying on the canonical ABI. For the time being, things like the standard library will have to be handled as native dependencies (e.g. Rust crates).
Sounds good. I'll remove the invocation method for imports/exports introduced in my first Wasm translation PR.
> [!IMPORTANT] | ||
> The `[rodata]` section is only used for creating a package from a directory. | ||
> It will never be present in a manifest extracted from a package, because the | ||
> rodata will have been emitted in canonical on-disk form in the `rodata` directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see the benefits of having '[rodata]` section. I think we can get away with only specifying data segments on disk. I mean the file can also be crafted manually by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It serves an important purpose (primarily crucial for those defining packages by hand). The file on disk must be in canonical form, encoded such that it can be directly copied into Miden's word-addressable memory. It isn't sufficient to, as an example, take some file containing a bunch of string constants, and put it in the rodata
directory. That file must be encoded as some number of words, in the canonical binary representation for field elements. For data intended for a byte-addressable address space, this is done by slicing the data into 4 byte chunks, encoding each chunk as a field element, and then converting the field element into its canonical binary representation.
If you already have your data in canonical form, then yes, you don't need the [rodata]
section (and in fact, you can omit it in favor of placing the files in the rodata
directory instead). However, if you are creating a package, and what you have is that file of string constants I mentioned above, you're in trouble - there is no tooling that will do the conversion for you, and doing it yourself is error prone. The [rodata]
section is specifically aimed at automating this conversion for you using the package tooling. The result is, you can maintain the sources for a package in the human-friendly form, but easily produce the binary package with our tools.
To be clear, the [rodata]
section is only present when the manifest is being written by hand. When extracting a package to disk, regardless of how that package was defined, the manifest elides that section, and the canonical form is dumped to the rodata
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It serves an important purpose (primarily crucial for those defining packages by hand). The file on disk must be in canonical form, encoded such that it can be directly copied into Miden's word-addressable memory. It isn't sufficient to, as an example, take some file containing a bunch of string constants, and put it in the
rodata
directory. That file must be encoded as some number of words, in the canonical binary representation for field elements. For data intended for a byte-addressable address space, this is done by slicing the data into 4 byte chunks, encoding each chunk as a field element, and then converting the field element into its canonical binary representation.If you already have your data in canonical form, then yes, you don't need the
[rodata]
section (and in fact, you can omit it in favor of placing the files in therodata
directory instead). However, if you are creating a package, and what you have is that file of string constants I mentioned above, you're in trouble - there is no tooling that will do the conversion for you, and doing it yourself is error prone. The[rodata]
section is specifically aimed at automating this conversion for you using the package tooling. The result is, you can maintain the sources for a package in the human-friendly form, but easily produce the binary package with our tools.
We could introduce a separate command in our cargo extension or even a new conversion tool for generating the binary files in canonical form.
To be clear, the
[rodata]
section is only present when the manifest is being written by hand. When extracting a package to disk, regardless of how that package was defined, the manifest elides that section, and the canonical form is dumped to therodata
directory.
I think this adds an extra complexity for us and our users. I prefer the manifest to be the same before/after processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could introduce a separate command in our cargo extension or even a new conversion tool for generating the binary files in canonical form.
I think the risk here is knowing that you have to do this, let alone how, or that we provide a tool to do it. I suspect many users aren't even going to realize/understand that how they normally think about binary data simply does not translate to Miden's memory model. It's not even feasible for us to warn them about this, because we don't know anything about the data and how it is intended to be used.
I think this adds an extra complexity for us and our users. I prefer the manifest to be the same before/after processing.
The complexity exists one way or the other. Either we push it on to our users, or we provide a solution for it. In our case, we already solve this problem for memory in general in the compiler, so we are best equipped to handle this for them, and to ensure its done correctly.
The tradeoff of the encoded manifest being different than the hand-written one is just that, a tradeoff. To be clear though, there is no technical reason that we can't extract to a form that is identical to that created by the user, just that it requires storing more metadata in the encoded package to do so, so that we can recover the original file names, and an additional post-processing step would be required to convert the data back to its original form.
The reason I chose not to do that here is twofold: First, it is not useful to the primary consumer of a package, the VM. Second, it is additional data to store for the case in which someone wishes to extract a package into its original source form. That doesn't mean it isn't useful, just that it felt like optimizing for an edge case.
|
||
> [!IMPORTANT] If there are dependencies specified in the WIT type definition file, then they | ||
> must agree with those specified in the `[dependencies]` section. This will be checked when | ||
> a package is created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are namespace
part in WIT (as in namespace:package@version
) which is not addressed here. We could employ namespaces as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was imagining the package name
field would be the fully-qualified name, e.g. namespace:package
, but I think you're on to something here, and that it would make sense to have all three as separate fields, i.e. namespace
, name
, and version
.
I'm not yet sure how we want to make use of namespaces, whether we want to reserve some of them, etc., but I think it makes sense to formalize it as part of the fully-qualified package name, and reserve core
, std
, miden
, and mast
for now. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was imagining the package
name
field would be the fully-qualified name, e.g.namespace:package
, but I think you're on to something here, and that it would make sense to have all three as separate fields, i.e.namespace
,name
, andversion
.
Actually, I think namespace:package
format for our name
field should be enough. This approach is used in cargo-component
for WIT dependencies.
I'm not yet sure how we want to make use of namespaces, whether we want to reserve some of them, etc., but I think it makes sense to formalize it as part of the fully-qualified package name, and reserve
core
,std
,miden
, andmast
for now. Thoughts?
I agree.
segment_data: [SegmentData; num_segments], | ||
dependency_infos: [DependencyData; num_dependencies], | ||
wit: Component, | ||
mast: MastForest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider compressing the binary data for Component
and MastForest
? It seems like data savings might be noticable. We could make the compression optional and hide it behind a flag(s) in features
bitflags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of those structures have compact binary representations, so they should already be quite compact. That said, my intention is that the package format is purely a container, so compression can be applied to the container as a whole, using whatever compression algorithm is best suited to the situation.
This would be akin to tarballs, which are uncompressed by default, but can be compressed using a variety of different schemes (bzip2, gunzip, lzma, etc.), yet understood by a single tool (i.e. tar
) simply using the file extension as a signal for which compression scheme was used. We don't necessarily need to go that far ourselves, but I think the idea would be to separate compression from the container format itself, so that a choice can be made for whether to compress, and how to do so, on a case-by-case basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is somewhat related to the comment I made about off-chain vs. on-chain formats. I think for off-chain format, general compression algorithm would probably work fine. But for on-chain format (when we don't really care much about deserialization speed), we should be able to do much better than a generic algorithm would because we can omit serialization of all intermediate hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it goes on-chain, we should squeeze every byte we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobbinth Can you elaborate on how we can omit serialization of the full MAST (at least the parts that are "new", i.e. do not reference externally-defined code)? My understanding is that having the MAST root is not equivalent to having the actual MAST, just that all you need is the MAST root to validate the entire tree. The package format is intended to solve the question of obtaining the tree in the first place.
That said, I think we can certainly use our own homegrown schemes for encoding to binary, but it would be hard to beat off-the-shelf schemes for packages, since the package itself contains a variety of different bits of data, some of which we don't control (i.e. rodata segments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a part of the answer in #132 (review) - but to summarize: if our goal is to compress the binary as much as possible, there are a couple of things we can do differently:
- We can omit serializing all digests and instead serialize only basic blocks (i.e., span nodes) and some minimal metadata for all other nodes (i.e., on the order on a few bits per node). This would effectively reduce the size of the binary by close to
$32 \cdot n$ bytes where$n$ is the number of nodes in the MAST. It will also make deserialization much more computationally intensive - so, there are tradeoffs. - We can make the format more "serial" - i.e., get rid of ability to do arbitrary lookups into the serialized MAST. This should allow shaving off some bytes too, but the impact is more difficult to estimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This looks great! So far, I've only skimmed through the document - but wanted to leave a few comments/questions already (it is also possible that the answers are already there, and I just missed them).
One general thing I wonder is whether we actually may need two separate formats:
- Off-chain format - this format would be meant to be consumed by the compiler and/or VM.
- On-chain format - this format would contain minimum amount of data needed for the VM to execute the program.
This document would fall more into the "off-chain" category.
* A manifest that provides top-level metadata about the package | ||
* The Merkelized Abstract Syntax Tree (MAST) which was produced by compiling the original | ||
source code. The MAST is what will be directly executed by the Miden VM at runtime. | ||
* A WebAssembly Interface Types (WIT) component definition, which is used to describe | ||
the interfaces exported from the package, their type signatures, and other useful metadata | ||
about the structure of the package. See the [Type Descriptor](#type-descriptor) | ||
section for more details. | ||
* Zero or more read-only data segments. See the [Read-Only Data Segments](#read-only-data-segments) | ||
* One or more optional items that can be used with the package for development | ||
and debugging, the structure of which depends on the specific item: | ||
* Debug info | ||
* Documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the standpoint of Miden VM, only the MAST and rodata sections (and maybe the manifest) are required to execute a program. Should WIT definitions be optional as well? I understand that without WIT definitions, the package is probably useless for the compiler, but the VM would just discard it even when present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's true today, but in the future, particularly if we develop some runtime aspects of the Component Model, the definition of a component is essential for that.
More broadly though, the component definition (what you are referring to here as WIT, which is a bit of a misnomer, WIT is primarily the textual format for defining WebAssembly components) tells us what components a package provides, and how to instantiate them. For now, we don't have any runtime support for components, so this largely just tells us what the APIs are of the package in terms that enable interop with that package - but this will change as we support more and richer functionality.
In other words, I expect that eventually the VM will use a component definition to inform/manage certain aspects of instantiation/initialization of the components described by a package, even if that is not the case today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of general thoughts here:
First, I would rather define the format for how things work today (or will work in the very near future). If/when we update the VM to work differently, we can always update the format to match the new behavior.
Second, I think we have 3 separate use cases for the package format each of them with slightly different needs. The use cases are:
- The compiler - to use packages as dependencies in other packages (i.e., use account package when writing note scripts).
- The VM - to execute a program or to supply enough info to the VM to execute a program.
- The rollup - to build/validate rollup objects (e.g., notes/accounts).
In the context of the rollup, we have only 2 types of components: accounts and notes. For account components, the needs of each use case are roughly as follows (for brevity, I will not discuss note components here):
- The compiler needs to do 2 things:
a. Output an account package (e.g., compiled from Rust) which will then be used by the VM/rollup. This needs to include the full MAST describing the account code, but doesn't really need higher-level metadata (e.g., WIT definitions).
b. Use account packages as dependencies when writing note scripts. This requires full WIT definitions, but technically, does not need the full MAST of account code (we only need to know MAST roots of public account procedures). - The VM needs to load the account component into its "object store" so that it can correctly resolve/execute calls to account procedures when executing note scripts. This includes loading both MAST data as well as advice data (e.g., to handle rodata).
a. In debug mode, the VM would also need additional data such as procedure names, source locations etc. - The rollup needs to do 2 things:
a. Compute the commitment to account code. This is computed by taking all public procedures of the account, putting their MAST roots into a Merkle tree, and using the root of the tree as the commitment.
b. Be able to serialize the account component as compactly as possible (while still supporting the point (a) above). This, for example, means stripping out full WIT definitions, package name and version, and maybe some other things.
If we want to use the same format to address all of the above goals, we either need to make more things optional and/or come up with some conventions.
For example, as mentioned above, for rollup purposes, we don't really need to have package names. So, we have 2 options:
- We can make package names optional.
- In context of accounts, define package names as commitments to account code (using the Merkle tree as I described above). This way, package name is implicit given the underlying code.
Alternatively, we could use different formats for different purposes. For example:
- The VM works only with MASTs according to the format described in #132 (in this case, rodata would need to move into the MAST definition, which may not be a bad idea in and of itself).
- The compiler works with the full package which includes package name, MAST, WIT definitions, debug info etc. (full component could also be loaded into the VM, in which case it would be able to provide richer diagnostics).
|- lib/ | ||
|- 0x00000000.mast | ||
|- 0x00000001.mast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does lib
basically contain a MastForest
? That is, each .mast
file is MAST where none of the nodes (except for the root) are referenced from any other MAST trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. Each .mast
file corresponds to a MAST root, but that root might be referenced from other roots in the same package. The purpose of breaking them up this way is to:
1.) Facilitate installation in the object store, de-duplicated by MAST root. When "installing" a package, the contents of lib
are more or less inserted directly into the object store, except the parts which are already present in the store. For example, two helper functions, whether in the same package/project or not, which happen to compile to the same MAST root will only appear once in the object store (and the package, if applicable).
2.) Allow loading just the parts of a MAST forest which are needed when the VM is loading code for a MAST root that is not in its procedure cache yet. The VM can directly load the MAST for that root from the object store this way. If the MAST references other MAST roots, those can either be loaded optimistically into the cache, or deferred until the code is actually being executed. The loader does not need to load/process the entire MAST forest of a package and then discard what is not needed.
In the binary form, it's all one big MAST forest, it's simply when extracted to disk that we break it up this way. Since a MastForest
can represent both a single tree and a forest, this means the code for loading MAST, executing it, and then fetching the parts that aren't present in that MastForest
is identical whether you loaded an entire package worth of code, or just code for a single procedure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is still not clear to me is what is the relationship of the lib
directory and the format described in #132.
It seems like the serialized version of lib
would correspond to EncodedMastForest
. But how to we break up a single EncodedMastForest
into multiple .mast
files (each of each, I'm assuming, is also serialized as EncodedMastForest
).
[package] | ||
name = "my-package" | ||
version = "0.1.0" | ||
miden-version = "0.8.0" # Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make miden-version
required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly could - I was thinking of it a bit like rust-version
, where you can signal that a given package cannot be compiled/used by versions of the tooling earlier than a given version due to the use of some specific feature or implementation detail. That won't be the case for all packages, and doesn't actually matter if it is missing or incorrect, since ultimately what will happen is that the code will fail at runtime rather than us being able to raise an error at compile-time.
So I think it largely makes more sense for it to be optional, but I could see a case being made for it to be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in one of the comments above, maybe this should go into the .mast
file format.
|
||
```toml | ||
[lib] | ||
start = "0x..." # Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments here:
Assuming that lib
directory contains a MastForest
, would we not also need to define which of the MAST roots define the public interface for the library? I'm specifically thinking about account code in the context of the Miden rollup: we could have a MastForest
with many trees (some representing commonly used procedures), but only some of them would be exported publicly (i.e., be a part of the account interface).
Also, would I wonder if it would make sense to commit to the whole MAST forest somehow - similar to how entrypoint
for executable programs commits to everything that can be executed by the program. The way we do it for accounts is by building a Merkle tree from the roots of all public procedures. The root of this Merkle tree then commits to the entire module.
Lastly, is start
more for the future? Right now, Miden VM would not be able to process a package with the start
program. Unless, we make an assumption that before any public procedure start
program needs to be executed. But I'm not sure if it is a good assumption to make in the VM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that lib directory contains a MastForest, would we not also need to define which of the MAST roots define the public interface for the library?
I alluded to this in my reply to your comment about the WIT/component definition - but this is exactly what the component definition tells us. That's not to say that we can't provide this information another way, but we essentially kill two birds with one stone.
Also, would I wonder if it would make sense to commit to the whole MAST forest somehow - similar to how entrypoint for executable programs commits to everything that can be executed by the program.
I don't think committing to the forest is necessary, since the MAST roots already provide that at a more granular level (and you likely only care about the parts of the package you actually use in practice). However, I do think there is benefit in committing to the package contents (or signing them and validating the signature), in order to ensure that when you fetch a package from some remote server (as an example), that you are getting the exact same package you requested, and not one that has been modified in some subtle way. Maybe that's what you were getting at though.
Lastly, is start more for the future? Right now, Miden VM would not be able to process a package with the start program. Unless, we make an assumption that before any public procedure start program needs to be executed. But I'm not sure if it is a good assumption to make in the VM.
There are a few things here:
- The
start
function is exactly that - a procedure which is executed before any other code in the same component, and is responsible for one-time initialization for a given instance of the component (i.e. it is run in each new instance of the component). This function does not have access to any other code in the component while it is being executed. The compiler will enforce this in packages it produces, but for now the VM doesn't need to care about this. - This is primarily useful for the compiler at this point in time, when compiling an executable, so that it can inject calls to the
start
function of any packages it depends on in the compiler-generated setup code in the entrypoint. It also, for the time being, will do this in the prologue of procedures which arecall
-able (same as with rodata initialization). Ideally though, this will be an aspect of component initialization that will be handled by the VM when it creates a new instance of a component, but obviously we don't have that functionality yet. - In the near term, I think a good approach to handling this would be for us to use an undocumented feature flag in executable packages to indicate that an executable was produced by our compiler, and then raise an error when any package with the
start
field set is loaded/used while executing a program without that feature. That feels a bit hacky, but allows us to tell users when they accidentally use a package in a way that violates its expectations. Ultimately, we would be able to just support these packages when we no longer need to rely on the compiler to handle this, so this would be trivially backwards-compatible.
For now, start
is primarily required to support rodata initialization in libraries compiled from Rust (as well as any Rust code that causes the start
function to be emitted in Wasm). When we are not producing an executable, we have nowhere to place this kind of code, and no way other than this to signal to other tools that such initialization is needed/required, and how to do so. The start
field gives us this, and on the VM side, if we choose to do so, provides us a way to recognize when a given package cannot be supported/executed correctly (with something akin to what I described in point 3 above in place).
[rodata] | ||
# Write the contents of `assets/foo.json` into linear memory at address 0x0, | ||
0x0 = { path = "assets/foo.json", encoding = "bytes" } | ||
0xDEADBEEF = { content = "0x...", encoding = "words" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You and @greenhat already kind of discussed this, but I also wonder if we can get away with encoding all the relevant data in the filenames while retaining the functionality that you described. For example, rodata
directory could look like this:
|- rodata/
|- 0x0.json
|- 0xDEADBEEF.bin
Here, the .bin
extension would signify that the data is in the canonical form, and .json
would signify that it encoded as bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I had in mind here was that a hand-maintained package could use a human-friendly layout and filenames, and yet be easily encoded to the canonical form, while retaining the ability to define data in both byte-addressable and word-addressable forms. Separating the metadata and the content is also more explicit, so we avoid any ambiguities.
To clarify what I mean, take for example a file named 0x0.json
- it doesn't tell you what is in that file just the fact that it is JSON (and JSON has no support for comments, so you can't put a comment inside the file to describe what it is, though that's a detail specific to JSON in this case). It's also non-obvious that the filename is the address at which that data will be written. Further, .bin
is not an uncommon file extension for binary data, so it could very well be that someone has a .bin
file they want to store in rodata
, but don't realize that this will unintentionally be treated as already in canonical form. Obviously we could use a different file extension, but I think the general point I'm trying to make is just that there is a lot of implicit magic happening in that scenario that the manifest form makes explicit and non-magical.
That's not to say that we couldn't do it that way, but I do think there is value in being able to read the manifest and immediately be able to understand what the contents of rodata
are, and how they will be processed by the package tooling.
Since both you and @greenhat had similar questions, perhaps that is evidence that it is in fact useful/important to make it possible to extract a package in the form it was originally defined, i.e. encoding a package and then extracting it with an option like --no-rodata-canonicalization
would give you back the exact same contents on disk as were used to create the package. I had specified in my initial draft that we would always encode the canonicalized form, since that's all the VM needs, and so we would drop the [rodata]
metadata we don't need in the process, like the filename and encoding type, but there is no reason why we can't support both. Doing so requires preserving that metadata in the package so that we can recover the original filenames and contents, but there are no technical limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in some previous comments, I'm actually wondering if we should put rodata (or more specifically, the canonicalized form of rodata) into the MAST. There are two arguments for this:
- If execution of a program requires rodata, providing just the MAST is not enough to execute the program and so
.mast
files are not "self-sufficient". So, if we do want to make them self-sufficient (and this is something to discuss), then rodata should go there. - Rodata is actually a special case of a more general "advice data" which we may want to attach to a given MAST. This could be a way to specify that before we can execute this MAST we need to ensure that the specified data is loaded into the advice provider.
#### Dependencies | ||
|
||
The final section of the manifest, is `[dependencies]`. This section specifies what dependencies this | ||
package has, and optionally, how to fetch them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we talk about dependencies, do we mean only other package dependencies? If so, I wonder if this would be sufficient.
For example, let's say from my program I call sha256
procedure in the standard library. MAST for this procedure would not be in the MastForest
of the package (in the MAST forest, we'd have a leaf representing an "external reference" with the MAST root of the sha256
procedure). It may be nice to have MAST root of sha256
procedure listed as a dependency. We can, of course, get this info by scanning the MAST forest and looking at all external reference nodes - but that's probably less convenient than having all of them listed explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard library would in fact be a dependency of sorts for exactly this reason, I had in fact expected that we would distribute it as a package just like any other library. Same with kernels, they can all be packages. The VM would essentially be shipped without any supporting code at all (i.e. no standard library, no kernel), as that would all be installed on an as-needed basis (and would be essential for supporting multiple versions of the standard library at the same time on the same VM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline recently, some libraries may need to have a special standing (at least in the context of the rollup). This is related to "partially-vendored" packages. For example, initially, we'll probably need to vendor all dependencies which are not a part of miden standard library or miden rollup library. But there would be no need to vendor these two libraries as we can assume all users of the rollup will have access to all versions of these two libraries.
In the future, we may have an on-chain registry of packages, and in such a case, we can make packages progressively less and less vendored.
[dependencies] | ||
foo = { version = "0.1.0", digest = "..." } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if specifying both the version and the digest is somewhat redundant - maybe it should be either one or the other? Basically, if we specify the digest, we don't really care about which version it is (or maybe we do?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I was expecting both to be present at the same time, but I don't see any reason why we couldn't support looking up a package by digest alone.
I was initially expecting that a version
would be used without a digest
, and is equivalent of stipulating "as long as you have a package that matches this version requirement, I don't care what the digest is". However, after doing the assembler refactoring, I don't think it will ever be the case that we will be referencing code by anything other than MAST digest, so version
is effectively meaningless (it is perhaps useful as a way of answering the question "what semantic version does this digest correspond to?", but that's about it, and could be provided via other means).
I'll mull this over a bit just to try and confirm that I'm not forgetting some detail that I had in mind when I drafted this, but I think we may be able to remove version
from the package manifest.
This adds a new document to the appendix which describes my initial spec for the package format. I expect there will be some iteration on this, and am trying something a little different here by doing this proposal via our documentation, rather than via GitHub Issue or Discussion, the goal being to have our docs be an exact reflection of where we land on this, rather than something we write later on.
I'd suggest either building the docs locally to read it, or use GitHub's markdown previewer, it'll make reading it a lot more pleasant.
@bobbinth I added you as a reviewer because I want your feedback, and you'll need to sign off on this anyway. I should also note that this spec assumes the binary format for MAST is finalized, but I haven't yet submitted my proposal for its design. That said, it's design does not impose any meaningful constraints on what I've laid out here.
The signature scheme I describe in this document is less for security than it is for authenticating that the package is an exact copy of what was packaged, and that it has not been tampered with. If there are more compact schemes with the same properties, I'm open to suggestions, but this is one I'm already familiar with, which is why I chose it.
The overall encoding is designed to balance speed, compactness, and arbitrary access to its contents. It should be fast to authenticate, validate, and query common properties of the package, to allow packages to be encoded in this form at rest without sacrificing performance.