-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update SSZ spec to use SOS style offset based layout. #787
Update SSZ spec to use SOS style offset based layout. #787
Conversation
I'm inclined to favor this. BTW note that the ratio of added to removed lines in this PR overstates the complexity increase (ie. it looks bad but actually the complexity increase is quite slight), because most of the PR just spends the time defining fixed size and variable size (whereas the existing spec simply gives handwavy definitions) and then providing an English translation of the code. |
dfa68c1
to
d8b0d32
Compare
d8b0d32
to
87f8e05
Compare
I've rebased this to include the |
specs/simple-serialize.md
Outdated
@@ -33,6 +33,8 @@ This is a **work in progress** describing typing, serialization and Merkleizatio | |||
|
|||
### Composite types | |||
|
|||
Composite types are limited to a maximum of `2**32 - 1` elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Merkleisation purposes this is unnecessarily strict. We want this limit for lists (because of the length mixing), but not for containers and vectors. One example use case is that we want to represent the sparse Merkle trees in shards as vectors of length 2^256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I had a reason for this I'm failing to remember it. Removing unless I remember in the next few minutes.
Btw, I think the 3 binary sections can be replaced with two. This was pointed out by @Arachnid too on Twitter. Will need to dig into it again a bit to figure out exactly how that was, don't remember. |
@karalabe I've been thinking this over too, though I can come up with reasons for both. I'm curious to hear what others think.
|
ca9cb8f
to
d21e781
Compare
I've updated the written spec to only use two segments with the first section being a combination of the serialized fixed-size values and the offsets. |
@jannikluhn would you mind reviewing this. I'd especially like help making the implementation example as easy to read and understand as possible. |
specs/simple-serialize.md
Outdated
assert not any(part is None for part in section_1_parts) | ||
section_1 = ''.join(section_1_parts) | ||
|
||
return ''.join([section_1, section_2]) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Looks like we've got two options for the offset: It could be relative to the beginning of the whole string (as it is right now) or relative to the beginning of the second section. In my opinion, the first option is slightly more elegant (e.g. an offset of zero is meaningful), but the second slightly more practical (a minimal function that extracts a dynamic element doesn't have to know the size of the first section, even though it's a constant for a given type). I don't really have a preference either way. |
I think that it MUST be relative to the beginning of the whole string. Lists of dynamic size elements are why. When you have a list of dynamic sized elements and the serialized value is not the empty byte string you only know one thing ahead of time.
Beyond that you know nothing about how many elements there are or where the offsets end and the serialized elements start. Once you read the first offset you can know two things.
You can only know this information if the offsets are from the beginning of the string. |
Ah, you're right, didn't think of lists. |
@karalabe would you mind giving this another 👀 now that it's been updated to interleave the fixed size elements and the offsets. @vbuterin do you want to review this? What is the process for getting this merged, should it be proposed on the broader Eth2.0 call or is this something we can just move forward with? |
specs/simple-serialize.md
Outdated
```python | ||
return "".join([serialize(element) for element in value]) | ||
``` | ||
* The first section is *fixed size* for all types, containing the concatenation of |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
The algorithm looks good to me! Seems reasonable to discuss on the next call. |
specs/simple-serialize.md
Outdated
return "".join([serialize(element) for element in value]) | ||
``` | ||
* The first section is fixed-size for all types, containing the concatenation of | ||
- The serialized representation for each of the fixed-size elements from the value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you consider putting offsets at their respective field locations? this maintains the fixed size of the fixed-size portion, but you'll find data about each field at the "natural" location.
further, it allows you to extend it with new fields more easily without breaking existing data: the new fixed-size fields doesn't change the location of the offsets, and in new data, the offsets jump over the new fields. a length prefix of the fixed-size section is the final piece needed, and a newer schema with more fields can still be used to read old data. this will be very handy when upgrading to eth 2.1.
another common thing and useful thing for easy decoding: alignment. tightly packed or aligned to their size? the latter is better for mmap-style access, if you're simply storing blobs somewhere. it's nice for flash memory and other resource-constrained devices. alignment to size should be very cheap for eth2 - most fields are uint64 or bytes32 and thus already aligned, unless someone accidentally puts a bool in the wrong place, and then the whole object is ruined - having it in the spec makes it easier to rely upon, even though in practice we can use it for most existing types already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example: for an object whose fields are int, list, int
, you'll encode: fixed-length, int, offset, int, list-data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you read the spec you'll see it defines the encoding just as you suggest, with the offsets intermixed with the fixed data elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks. I got confused by the bullet points, taking them to mean that all offsets come after all fixed-length fields.
I do not see that the fixed-length portion is length-prefixed - can you think of any other way to handle the following situation: we want to extend int, list, int
to int, list, int, int
but use the new schema to read the old data.
if there's a variable-length field in there, a trick would be to look at the first offset. if there is none however..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be a convention to have a schema id as the first field for objects that may need to be changed in the future (e.g. network messages). Then one could read the first int, choose the appropriate schema for the rest of the data, and continue as normal from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you think of any other way to handle the following situation: we want to extend
int
,list
,int
toint
,list
,int
,int
but use the new schema to read the old data.
My first instinct is to say this is not supported. My understanding is that we're addressing this at the wire protocol level so that we don't need this kind of compatibility and extending the schema would mean new networking protocol commands with new schemas.
Right, thanks. I got confused by the bullet points, taking them to mean that all offsets come after all fixed-length fields.
This is good information. It suggests that the spec still needs some improvement to make this concept more clear. I'll take another stab at clarifying and cleaning up language tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be a convention to have a schema id
at the wire protocol level
yes, agree with @pipermerriam that's sufficient for breaking upgrades - I don't think we need anything in this part of the spec for that kind. by breaking, I mean that it's essentially a mechanism for introducing new protocols, which may or may not be similar to existing ones.
however, the introduction of a length
field would allow forwards and backwards-compatible upgrades - upgrades where we don't need to break existing readers - incredibly useful for long-lived formats (think x86 machine code encoding level longevity). the length then works a little bit like a smart version - new clients use it to distinguish between new and old data, while old clients continue to work without needing rewrite. the cost is fairly low (4 bytes, or whatever) - it doesn't change streaming or any other properties of the encoding.
the idea is not novel in any way - it's very commonly used in C
structs for example when creating public API
's with the ability to upgrade internal details in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnetheduck I understand that use case but I think I have a strong-ish case against it.
By dictating that the length-prefix of the first section dictates the length of the first section, there becomes a section of bytes between the end of the first section and the beginning of the second section which is undefined and allowed to be anything. It's unclear to me how you would account for this in trie hash. It means that anyone who handles the V2
version of those objects would have to somehow know how to hash them as both versions and likely need to be able to reference them with either hash.
This outlines why I dislike this approach. It reminds me of this: https://linux-audit.com/linux-history-how-dot-files-became-hidden-files/
A seemingly innocuous choice can have significant unforeseen downstream effects. I'm 👎 but totally willing open to being convinced otherwise, though you would have to convince me of the following two things.
- There is a compelling reason why the wire protocol level mechanisms for upgrading would not be sufficient
- That there is a solution for the trie hashing problem which doesn't introduce required extra complexity into every SSZ library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I believe SSZ usage, if done well, will leak outside of this phase0 spec and into other areas ETH.
- One of the reasons to use SOS that's been mentioned is to access data from contracts efficiently. These are generally hard to upgrade and I believe would benefit from this capability - the ability to keep reading certain blobs without needing to be upgraded.
- Passive listeners and tooling that operates in a read-only fashion on the bytes they know about already degrade gracefully.
- nodes / entities that filter and forward based on the content of some fields, that don't need access to all data
- this is an interesting point, thanks! I think there are at least some mitigating factors - mainly that core clients that produce and verify, and above all, sign messages, will necessarily need to be upgraded to understand what they're singing - thus the bytes can generally not be anything - they need to be something credible in order to be accepted by the network as a whole. How to do the hashing in case you actually need to do that, hm, that would be trickier indeed. Might not be worth the cost, like you say.
f0cfc9d
to
f96bd33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be merged?
What's the status of an implementation in pyssz or as a minimal implementation here similar to https://github.com/ethereum/eth2.0-specs/blob/dev/utils/phase0/minimal_ssz.py We need an implementation for this to get into a release so we can generate the appropriate test vectors upon release. |
That's because my commits got nuked 😂. Please check my cleanup didn't break things. |
@JustinDrake it looks like your commits removed the actual text that describes the layout. I'll see about pushing that back in on Monday. |
To me the text was more confusing than helpful. The code is pretty readable now and the layout is described is the code comments.
Also bear in mind that we will shortly make the SSZ spec executable, similar to the main phase 0 document. In that context, text is mostly limited to code comments. |
Hi guys, a question here. Why do we discard the concept of length prefix in this proposal? I think it should be helpful when we decode all variable size types. Besides what's the default zero value for variable size types? |
@rjl493456442 length prefix was dropped because it isn't needed. Variable size types either:
|
@JustinDrake I'm a bit surprised at this. I can understand wanting to re-word it in a manner that is easier to understand, but fully removing any textual description of the layout feels like it will result in added confusion. Defining the standard as code means that the standard is the functionality of that code along with any bugs or edge cases it happens to not handle correctly. This is what the bitcoin protocol was for a long time (and maybe still is???). I respect this is a much simpler specification but I'm 👎 on this approach as it feels like a poor way to define a spec. |
I think this is a slightly different case in that the spec is written as code, but code that is as clearly written as possible. Clients will contain much more optimized code that will have to handle a lot more special cases. |
What was wrong?
The current SSZ specification uses a sequential layout for serialization. In contexts where computation is expensive such as the EVM, it can be valuable to have capabilities for quick indexing into serialized data without decoding the full message.
The current SSZ spec allows for sub-linear access times but is still probably 2-3x worse than it could be.
How it was fixed.
@karalabe created SOS which can be found here: https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632
It is designed to lay out the serialized values in such a way that requires minimal decoding to access inner elements.
Old SSZ Example
Under the old/existing SSZ layout the serialized representation of a value with the type
{uint32,bytes,bool}
would be arranged as:The encoded value
{1234, b'unicorns', True}
under this schema would be.0x
11000000
d2040000
08000000
756e69636f726e73
01
New SSZ example using SOS style layout
Under the new SOS style layout, the serialized value would be layed out as:
Seen as the three sections, it would be
<uint32-bytes><offset-for-bytes><bool-bytes>
<literal-bytes>
The new encoded value
{1234, b'unicorns', True}
under this schema would be.Broken down into the individual components
0x
d2040000
09000000
0x01
756e69636f726e73
Why this is better
Encoding and decoding should have very similar if not the same computational complexity. Both the old and new scheme require knowledge of the length of a serialized value to encode, thus requiring that you first encode the value, hold it in memory, compute the length, and then you can commit to the full serialized representation.
What makes this better is that it has a guaranteed upper bound of
O(log(N))
complexity for accessing nested elements. This optimization allows for environments with limited computation resources to more easily handle these data structures.There is also a minor efficiency gain in not requiring size prefixes on top-level-objects.