-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: draft spec for mast binary format #132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This looks great as well. I've also just skimmed though the document for now, but left a few comments inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great overall! Please, check my notes.
|
||
pub struct OpBatch<const BATCH_SIZE: usize = 8> { | ||
ops: Vec<Operation>, | ||
groups: [Felt; BATCH_SIZE], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot figure out the purpose of the groups
(especially the use of Felt
). Could you please clarify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary reason for this is the way the VM decodes programs. Since Felt
is the smallest unit the VM understands natively, we encode up to 9 operations into a single Felt
and we call these an "operation group". Then, up to 8 groups can be absorbed by a single permutation of the hash function we use (RPO) and the VM needs to execute additional operations between decoding more than 8 groups (i.e., a single batch). At the very high level, when we execute a basic block, the VM actually executes the following operations (the basic block consists of 3 batches):
SPAN
<operations from batch 0>
RESPAN
<operations from batch 1>
RESPAN
<operations from batch 2>
END
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks for the detailed explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also confused by this when reviewing the implementation of things in miden-core
, but after re-reading things several times I was able to arrive at more or less what @bobbinth explained here. As an aside, your explanation here is actually better, and more succinct than anything I came across, we might actually want to add it to the code in miden-core
as a brief summary for future readers!
/// The offset in the trailing data section of `EncodedMastForest` | ||
/// where the data for this node starts. If this node has no data, | ||
/// then the value will always be zero. Otherwise, the data should | ||
/// be decoded starting at this offset, the specifics of which | ||
/// depend on the node type | ||
offset: DataOffset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the node's data is the "first" in the buffer? It has a zero offset
, but we treat it as "no data".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What actually determines if the node has data or not is the MastNodeType
. So even if offset
is 0, if the node type has associated data, we will interpret offset
as a valid offset.
The comment here is referring to the fact that if there is no associated data, the value will always be zero, regardless of the current offset in data
where the next write will occur. For example, if the current offset in data
is 128
, and we're encoding a MastNodeInfo
for a node that has no associated data, then we will not store 128
as the offset in the encoded MastNodeInfo
, we will store 0
.
In reality, either choice would be fine, but 0 has the smallest possible encoding, and I didn't want to waste bytes encoding an offset that is never used.
} | ||
``` | ||
|
||
The only thing to note here is that we want the `MastNodeInfo` type to be fixed-width, to enable indexing. We also want to keep the encoding compact, but efficient. As a result, we are making a slight tradeoff with the encoding of this type to make traversal fast, by inlining references to other nodes. This means that there are padding bytes introduced for node types which do not have such references, and which cannot take advantage of the padding to store a subset of their data. This is estimated to be a worthwhile tradeoff, but isn't based on any benchmarks, just gut feeling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming it goes on-chain, I'd rather optimize it heavily for size. We can always offload slow("analytical") processing to an indexer, but the on-chain format is almost set in stone and would be a constantly growing "headache". Keeping in mind that the code size in smart contracts tends to be rather small, so we are talking kilobytes and not megabytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My expectation is that we are going to be compressing the serialized binary output of the encoder, and decompressing prior to feeding it into the decoder. Thus making our binary format highly structured is going to make it easier to compress to a very tiny size.
However, I do have two other possible encoding schemes in mind, which are variations of this one that use different techniques for serialization/deserialization with different tradeoffs. I want to evaluate this one first, to see the sizes of the compressed binary output before evaluating the others, as they are more complex, and thus produce smaller output, but that may make them less amenable to compression, and slower to encode/decode.
One of the small modifications we could easily make to this format though, is to delta-encode things like indices/offsets, as they are monotonically increasing. This could keep virtually all such indices/offsets as small as 1 byte for any given program. It also doesn't add much in terms of serialization/deserialization overhead.
I'm not particularly worried about padding bytes though, they essentially disappear under even the most trivial compression schemes. I'm much more interested in how to exploit the structure of MAST programs to make the encoding both small and compressible to the maximum degree possible.
Looks great! I don't think I can say something in addition to what was written above, but for me it looks awesome. |
9dd74f4
to
6bbe51a
Compare
I've updated the draft document with the feedback everyone has provided so far. A few things were more questions/clarifying of things, and so I've left those comments unresolved for subsequent review. Let me know if you have further additional feedback! I should also note that 0xPolygonMiden/miden-vm#1277 makes changes to Miden Assembly that will make transitioning to MAST as the binary format quite trivial (once the actual encoder/decoder is written), since all invocation targets (i.e. the callee of instructions like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you again for such a comprehensive write up! I left some comments inline - but most are minor (and some may be premature optimizations).
One thing I wanted to mention more specifically (and this also referenced in one of the comments), I think we actually need 2 variants of the format: "fast" and "compact". The current spec defines the "fast" format - which I think is the right one to start with (the "compact" variant we can define much later).
The reason why I think we need to differentiate between the two explicitly is because they have different trust assumptions and structures. Specifically:
- The "fast" variant is trusted. That is, the deserializer would trust that the serializer encoded everything correctly. The main benefit of this is that we don't need to do any hashing during deserialization (e.g., we trust that digests included with each node are correct).
- The "compact" variant is un-trusted. That is, the deserializer would need to compute hashes of all nodes to verify that they are correct. This introduces significant computational overhead, but it does allow removing the
digest
field fromMastNodeInfo
(excluding theExternal
variant) - which, I expect, would reduce the size of the binary quite a bit (primarily because digests are basically uncompressible data).
The "compact" variant would be the one which would be distributed between parties which don't trust each other. But once I download the "compact" variant from someone, I can convert it to "fast" variant and store it on disc - and so the VM would work primarily (or maybe even exclusively) with the "fast" variant.
As mentioned above, we don't need to worry about the "compact" variant now, with the only exception being potential addition of a binary flag to indicate fast/compact encoding for a given file.
|
||
#### Variable-Width Integers | ||
|
||
First, let's introduce a type that we will use throughout the rest of this spec, `VInt64`. It represents a variable-width, unsigned, little-endian 64-bit integer. It is encoded as one or more bytes, where each byte has 7 bits to represent the value, with the last bit used to signal when another byte is needed to represent the encoded value. In this case, unless otherwise noted, we treat it as an error if a `VInt64`-encoded integer would not fit in the integral type given. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the explanation of the VInt64
is not exactly correct here (i.e., I don't believe it uses a "continuation" bit in each byte but rather encodes the total number of bytes using trailing zeros of the first byte).
/// The format version, initially all zeros | ||
/// | ||
/// If future modifications are made to this format, the version | ||
/// should be incremented by 1. A version of 0b111 is reserved | ||
/// for future extensions that require extending the version | ||
/// field itself, but should be considered invalid for now. | ||
version: [u8; 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably need 2 variations of the format - one for "fast" encoding/decoding and another one is for "compact" encoding (see overall review comment). So, I'm wondering if, in addition to the version
field, we also need some binary flag which indicates whether this file is "fast" or "compact".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other question re version
: would this implicitly encode the VM version as well? Specifically, let's say we modify the instruction set in some way - would the format version be incremented? If not, would it make sense to add something like miden_version
field here (instead of it being in the package definition)?
/// There is one byte for every 8 nodes in the set, rounded to the | ||
/// nearest multiple of 8. | ||
roots: [u8; (self.node_count / 8) + (self.node_count % 8 > 0 as usize)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using 1 bit per node to indicate that a node is a root is a premature optimization at this point. We are basically assuming that the proportion of roots to nodes is going to be relatively high. But, if for example, we have 1000 nodes and only a dozen roots, recording them as a list of node indexes is going to be about 5x more compact (125 bytes in case of 1 bit per node vs. ~25 bytes in case of recording indexes for a dozen of nodes).
Overall, I'm not opposed to this approach, but I do wonder if we should see some evidence first that this is indeed an improvement over the simpler approach.
#[repr(u8)] | ||
pub enum MastNodeType { | ||
/// For more efficient decoding, and because of the | ||
/// frequency with which these node types appear, we | ||
/// directly represent the child indices for `Join`, | ||
/// `Split`, and `Loop` inline. This incurs 8 bytes | ||
/// of padding for other node types, except `Block` | ||
/// which takes advantage of this to also inline some | ||
/// of its data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a premature optimization on my part, but I wonder if we should always encode MastNodeType
as 8 bytes (rather than the current 9). Basically, the first 4 bits could be used to encode the variant, and then we could still encode 2 node indexes with 30 bits each. This does cap the maximum number of nodes at
On the other hand, having MastNodeType
aligned on 8-byte boundaries may be beneficial.
Block { | ||
/// The maximum number of operations in a block | ||
/// is always smaller than `u8::MAX` | ||
len: u8 | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is u8::MAX
is the maximum operations in a block? I don't think we have such a limit. I think here we could probably use u32::MAX
.
/// A simple operation, with its opcode encoded in a single byte | ||
Op { opcode: u8 }, | ||
/// An operation with an immediate argument. | ||
/// | ||
/// Like `Op`, the opcode takes a single byte, and the immediate | ||
/// is encoded using a variable-width integer. The immediate must | ||
/// be converted to/from `Felt` when decoded. | ||
OpWithImmediate { | ||
opcode: u8, | ||
imm: VInt64, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this distinction is actually needed. By reading the opcode we could tell whether this operation has an immediate or not. Currently, the only operation which has an immediate is the PUSH
operation.
/// Unlike most other type definitions here, this one | ||
/// is not encoded using the usual Rust layout rules. | ||
/// | ||
/// Specifically, we treat this as a variable-width type, | ||
/// without the padding that would normally be introduced | ||
/// to make smaller variants as large as the largest variant. | ||
/// | ||
/// So each encoded operation starts with the tag byte, which | ||
/// indicates which variant we're going to be parsing. We then | ||
/// parse the data for that variant, and immediately look for | ||
/// the next tag byte (if there are more ops to parse). | ||
/// | ||
/// It is strictly for purposes of readability. | ||
/// | ||
/// NOTE: Decorators are always associated with the next non-decorator | ||
/// operation in the basic block being decoded. | ||
#[repr(u8)] | ||
pub enum EncodedOperation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the comment explains exactly this, but I think we don't need a special enum to differentiate different types of operations and decorators. Specifically, we just need 1 byte to encode both the tag and the opcode.
This is possible because operations require only 7 bits to encode. So, we can say that if the first bit of a byte is 0, the remaining 7 bits encode an opcode; if the first bit is 1, the remaining 7 bits encode tag of a decorator.
This way, vast majority of EncodedOperation
's would require 1 byte to encode (rather than 2 bytes).
One other thing I forgot to clarify: would things like source locations and other debug-related data go into the |
Actually, re-reading #129, I've answered my own question: debug data goes into the package but not into MAST. |
This has been implemented in 0xPolygonMiden/miden-vm, more or less, so I'm closing this PR since it was primarily meant for drafting the initial spec. |
This PR adds a new document which describes the new in-memory representation we've been discussing recently, as well as provides a draft specification for the MAST binary format. This isn't meant to be the canonical source for the specification or anything, I just found it convenient to draft it here, and figured it would be a good way to get some feedback before opening a more formal proposal in the
miden-vm
repo.@bobbinth You're going to be the one most interested in what's here. Feel free to skip over the section that details aspects of the in-memory representation. Nothing has really changed there since our discussion over in the
miden-vm
repo, but I tried to write up a bit more of a summary/intro document to make sure I understood all the pieces more or less. The primary thing I'd like to get your opinion on is the specification of the binary format itself. It's not a formal spec, but it is precisely defined.@greenhat Also interested in any notes you have, a lot of the details here are quite similar to the package format we're also discussing, so it should be pretty familiar already. The two formats are intended to have some synergistic properties, so that's part of the reason they are so similar.
Anyway, once you both feel like this is ready to propose in the
miden-vm
repo, I'll open an issue for it there. Consider this an early draft.