Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SSZ spec to use SOS style offset based layout. #787

Merged

Conversation

pipermerriam
Copy link
Member

@pipermerriam pipermerriam commented Mar 15, 2019

What was wrong?

The current SSZ specification uses a sequential layout for serialization. In contexts where computation is expensive such as the EVM, it can be valuable to have capabilities for quick indexing into serialized data without decoding the full message.

The current SSZ spec allows for sub-linear access times but is still probably 2-3x worse than it could be.

How it was fixed.

@karalabe created SOS which can be found here: https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632

It is designed to lay out the serialized values in such a way that requires minimal decoding to access inner elements.

Old SSZ Example

Under the old/existing SSZ layout the serialized representation of a value with the type {uint32,bytes,bool} would be arranged as:

<total-length-prefix><uint32-bytes><length-prefix><literal-bytes><bool-byte>

The encoded value {1234, b'unicorns', True} under this schema would be.

0x11000000d204000008000000756e69636f726e7301
total-length-prefix uint32-bytes bytes-length-prefix literal-bytes bool-byte
0x 11000000 d2040000 08000000 756e69636f726e73 01

New SSZ example using SOS style layout

Under the new SOS style layout, the serialized value would be layed out as:

<uint32-bytes><uint32-offset-for-bytes><bool-bytes><literal-bytes>

Seen as the three sections, it would be

section-1 section-2
<uint32-bytes><offset-for-bytes><bool-bytes> <literal-bytes>

The new encoded value {1234, b'unicorns', True} under this schema would be.

0xd20400000900000001756e69636f726e73

Broken down into the individual components

uint32-bytes offset-for-bytes bool-bytes literal-bytes
0x d2040000 09000000 0x01 756e69636f726e73

Why this is better

Encoding and decoding should have very similar if not the same computational complexity. Both the old and new scheme require knowledge of the length of a serialized value to encode, thus requiring that you first encode the value, hold it in memory, compute the length, and then you can commit to the full serialized representation.

What makes this better is that it has a guaranteed upper bound of O(log(N)) complexity for accessing nested elements. This optimization allows for environments with limited computation resources to more easily handle these data structures.

There is also a minor efficiency gain in not requiring size prefixes on top-level-objects.

@pipermerriam pipermerriam changed the title Update SSZ spec to use SOS style offset based layout. [WIP] Update SSZ spec to use SOS style offset based layout. Mar 15, 2019
@vbuterin
Copy link
Contributor

I'm inclined to favor this. BTW note that the ratio of added to removed lines in this PR overstates the complexity increase (ie. it looks bad but actually the complexity increase is quite slight), because most of the PR just spends the time defining fixed size and variable size (whereas the existing spec simply gives handwavy definitions) and then providing an English translation of the code.

@pipermerriam pipermerriam force-pushed the piper/add-sos-style-offset-serialization-to-ssz branch from dfa68c1 to d8b0d32 Compare March 18, 2019 14:57
@pipermerriam pipermerriam changed the title [WIP] Update SSZ spec to use SOS style offset based layout. Update SSZ spec to use SOS style offset based layout. Mar 18, 2019
@pipermerriam pipermerriam force-pushed the piper/add-sos-style-offset-serialization-to-ssz branch from d8b0d32 to 87f8e05 Compare March 18, 2019 14:59
@pipermerriam
Copy link
Member Author

I've rebased this to include the tuple -> vector rename, remove the duplicate definition of fixed/variable size, and cleanup some of the language.

@@ -33,6 +33,8 @@ This is a **work in progress** describing typing, serialization and Merkleizatio

### Composite types

Composite types are limited to a maximum of `2**32 - 1` elements.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Merkleisation purposes this is unnecessarily strict. We want this limit for lists (because of the length mixing), but not for containers and vectors. One example use case is that we want to represent the sparse Merkle trees in shards as vectors of length 2^256.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I had a reason for this I'm failing to remember it. Removing unless I remember in the next few minutes.

@karalabe
Copy link
Member

karalabe commented Mar 18, 2019

Btw, I think the 3 binary sections can be replaced with two. This was pointed out by @Arachnid too on Twitter. Will need to dig into it again a bit to figure out exactly how that was, don't remember.

@pipermerriam
Copy link
Member Author

@karalabe I've been thinking this over too, though I can come up with reasons for both. I'm curious to hear what others think.

  • Why keep 3 sections
    • maybe easier to explain at the spec level, keeping offsets and data distinct.
  • Why go to 2 sections
    • can be viewed as more pure, eliminated an un-necessary extra section that only holds offsets.
    • might be slightly more efficient to index into the data since you don't have to parse the types to determine appropriate location of a dynamic length value's offset (most compelling argument I see).

@pipermerriam pipermerriam force-pushed the piper/add-sos-style-offset-serialization-to-ssz branch from ca9cb8f to d21e781 Compare March 18, 2019 16:26
@pipermerriam
Copy link
Member Author

I've updated the written spec to only use two segments with the first section being a combination of the serialized fixed-size values and the offsets.

@pipermerriam
Copy link
Member Author

@jannikluhn would you mind reviewing this. I'd especially like help making the implementation example as easy to read and understand as possible.

assert not any(part is None for part in section_1_parts)
section_1 = ''.join(section_1_parts)

return ''.join([section_1, section_2])

This comment was marked as resolved.

This comment was marked as resolved.

specs/simple-serialize.md Outdated Show resolved Hide resolved
specs/simple-serialize.md Outdated Show resolved Hide resolved
specs/simple-serialize.md Outdated Show resolved Hide resolved
specs/simple-serialize.md Outdated Show resolved Hide resolved
@jannikluhn
Copy link
Contributor

Looks like we've got two options for the offset: It could be relative to the beginning of the whole string (as it is right now) or relative to the beginning of the second section. In my opinion, the first option is slightly more elegant (e.g. an offset of zero is meaningful), but the second slightly more practical (a minimal function that extracts a dynamic element doesn't have to know the size of the first section, even though it's a constant for a given type). I don't really have a preference either way.

@pipermerriam
Copy link
Member Author

Looks like we've got two options for the offset: It could be relative to the beginning of the whole string (as it is right now) or relative to the beginning of the second section.

I think that it MUST be relative to the beginning of the whole string. Lists of dynamic size elements are why.

When you have a list of dynamic sized elements and the serialized value is not the empty byte string you only know one thing ahead of time.

  • The first 4 bytes must be a serialized offset.

Beyond that you know nothing about how many elements there are or where the offsets end and the serialized elements start. Once you read the first offset you can know two things.

  • Where the first serialized element starts
  • How many elements are in the list (since the first offset also denotes where the offsets end).

You can only know this information if the offsets are from the beginning of the string.

@jannikluhn
Copy link
Contributor

I think that it MUST be relative to the beginning of the whole string. Lists of dynamic size elements are why.

Ah, you're right, didn't think of lists.

@pipermerriam
Copy link
Member Author

@karalabe would you mind giving this another 👀 now that it's been updated to interleave the fixed size elements and the offsets.

@vbuterin do you want to review this?

What is the process for getting this merged, should it be proposed on the broader Eth2.0 call or is this something we can just move forward with?

@hwwhww hwwhww added the scope:SSZ Simple Serialize label Mar 21, 2019
```python
return "".join([serialize(element) for element in value])
```
* The first section is *fixed size* for all types, containing the concatenation of

This comment was marked as resolved.

@vbuterin
Copy link
Contributor

The algorithm looks good to me! Seems reasonable to discuss on the next call.

return "".join([serialize(element) for element in value])
```
* The first section is fixed-size for all types, containing the concatenation of
- The serialized representation for each of the fixed-size elements from the value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you consider putting offsets at their respective field locations? this maintains the fixed size of the fixed-size portion, but you'll find data about each field at the "natural" location.

further, it allows you to extend it with new fields more easily without breaking existing data: the new fixed-size fields doesn't change the location of the offsets, and in new data, the offsets jump over the new fields. a length prefix of the fixed-size section is the final piece needed, and a newer schema with more fields can still be used to read old data. this will be very handy when upgrading to eth 2.1.

another common thing and useful thing for easy decoding: alignment. tightly packed or aligned to their size? the latter is better for mmap-style access, if you're simply storing blobs somewhere. it's nice for flash memory and other resource-constrained devices. alignment to size should be very cheap for eth2 - most fields are uint64 or bytes32 and thus already aligned, unless someone accidentally puts a bool in the wrong place, and then the whole object is ruined - having it in the spec makes it easier to rely upon, even though in practice we can use it for most existing types already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example: for an object whose fields are int, list, int, you'll encode: fixed-length, int, offset, int, list-data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you read the spec you'll see it defines the encoding just as you suggest, with the offsets intermixed with the fixed data elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks. I got confused by the bullet points, taking them to mean that all offsets come after all fixed-length fields.

I do not see that the fixed-length portion is length-prefixed - can you think of any other way to handle the following situation: we want to extend int, list, int to int, list, int, int but use the new schema to read the old data.

if there's a variable-length field in there, a trick would be to look at the first offset. if there is none however..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a convention to have a schema id as the first field for objects that may need to be changed in the future (e.g. network messages). Then one could read the first int, choose the appropriate schema for the rest of the data, and continue as normal from there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you think of any other way to handle the following situation: we want to extend int, list, int to int, list, int, int but use the new schema to read the old data.

My first instinct is to say this is not supported. My understanding is that we're addressing this at the wire protocol level so that we don't need this kind of compatibility and extending the schema would mean new networking protocol commands with new schemas.

Right, thanks. I got confused by the bullet points, taking them to mean that all offsets come after all fixed-length fields.

This is good information. It suggests that the spec still needs some improvement to make this concept more clear. I'll take another stab at clarifying and cleaning up language tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a convention to have a schema id

at the wire protocol level

yes, agree with @pipermerriam that's sufficient for breaking upgrades - I don't think we need anything in this part of the spec for that kind. by breaking, I mean that it's essentially a mechanism for introducing new protocols, which may or may not be similar to existing ones.

however, the introduction of a length field would allow forwards and backwards-compatible upgrades - upgrades where we don't need to break existing readers - incredibly useful for long-lived formats (think x86 machine code encoding level longevity). the length then works a little bit like a smart version - new clients use it to distinguish between new and old data, while old clients continue to work without needing rewrite. the cost is fairly low (4 bytes, or whatever) - it doesn't change streaming or any other properties of the encoding.

the idea is not novel in any way - it's very commonly used in C structs for example when creating public API's with the ability to upgrade internal details in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnetheduck I understand that use case but I think I have a strong-ish case against it.

By dictating that the length-prefix of the first section dictates the length of the first section, there becomes a section of bytes between the end of the first section and the beginning of the second section which is undefined and allowed to be anything. It's unclear to me how you would account for this in trie hash. It means that anyone who handles the V2 version of those objects would have to somehow know how to hash them as both versions and likely need to be able to reference them with either hash.

This outlines why I dislike this approach. It reminds me of this: https://linux-audit.com/linux-history-how-dot-files-became-hidden-files/

A seemingly innocuous choice can have significant unforeseen downstream effects. I'm 👎 but totally willing open to being convinced otherwise, though you would have to convince me of the following two things.

  1. There is a compelling reason why the wire protocol level mechanisms for upgrading would not be sufficient
  2. That there is a solution for the trie hashing problem which doesn't introduce required extra complexity into every SSZ library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I believe SSZ usage, if done well, will leak outside of this phase0 spec and into other areas ETH.
  • One of the reasons to use SOS that's been mentioned is to access data from contracts efficiently. These are generally hard to upgrade and I believe would benefit from this capability - the ability to keep reading certain blobs without needing to be upgraded.
  • Passive listeners and tooling that operates in a read-only fashion on the bytes they know about already degrade gracefully.
  • nodes / entities that filter and forward based on the content of some fields, that don't need access to all data
  1. this is an interesting point, thanks! I think there are at least some mitigating factors - mainly that core clients that produce and verify, and above all, sign messages, will necessarily need to be upgraded to understand what they're singing - thus the bytes can generally not be anything - they need to be something credible in order to be accepted by the network as a whole. How to do the hashing in case you actually need to do that, hm, that would be trickier indeed. Might not be worth the cost, like you say.

@pipermerriam pipermerriam force-pushed the piper/add-sos-style-offset-serialization-to-ssz branch 3 times, most recently from f0cfc9d to f96bd33 Compare March 28, 2019 14:24
Copy link
Member Author

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be merged?

@djrtwo
Copy link
Contributor

djrtwo commented Apr 5, 2019

What's the status of an implementation in pyssz or as a minimal implementation here similar to https://github.com/ethereum/eth2.0-specs/blob/dev/utils/phase0/minimal_ssz.py

We need an implementation for this to get into a release so we can generate the appropriate test vectors upon release.

@JustinDrake
Copy link
Collaborator

Actually @JustinDrake I'm not seeing which cleanups you are referring to

That's because my commits got nuked 😂. Please check my cleanup didn't break things.

@pipermerriam
Copy link
Member Author

@JustinDrake it looks like your commits removed the actual text that describes the layout. I'll see about pushing that back in on Monday.

@JustinDrake
Copy link
Collaborator

To me the text was more confusing than helpful. The code is pretty readable now and the layout is described is the code comments.

  • Section 1: Serialisation of fixed parts interleaved with 4-byte offsets for the serialisations of variable parts
  • Section 2: Serialisation of variable parts

Also bear in mind that we will shortly make the SSZ spec executable, similar to the main phase 0 document. In that context, text is mostly limited to code comments.

@rjl493456442
Copy link
Member

rjl493456442 commented Apr 15, 2019

Hi guys, a question here. Why do we discard the concept of length prefix in this proposal? I think it should be helpful when we decode all variable size types.

Besides what's the default zero value for variable size types?

@pipermerriam
Copy link
Member Author

@rjl493456442 length prefix was dropped because it isn't needed. Variable size types either:

  • have a known length as the length of the encoded bytes when dealt with individually.
  • have their length defined by the adjacent offset values when dealing with them embedded in another serialized object.

@pipermerriam
Copy link
Member Author

@JustinDrake I'm a bit surprised at this.

I can understand wanting to re-word it in a manner that is easier to understand, but fully removing any textual description of the layout feels like it will result in added confusion. Defining the standard as code means that the standard is the functionality of that code along with any bugs or edge cases it happens to not handle correctly. This is what the bitcoin protocol was for a long time (and maybe still is???). I respect this is a much simpler specification but I'm 👎 on this approach as it feels like a poor way to define a spec.

@dankrad
Copy link
Contributor

dankrad commented Apr 16, 2019

I can understand wanting to re-word it in a manner that is easier to understand, but fully removing any textual description of the layout feels like it will result in added confusion. Defining the standard as code means that the standard is the functionality of that code along with any bugs or edge cases it happens to not handle correctly. This is what the bitcoin protocol was for a long time (and maybe still is???). I respect this is a much simpler specification but I'm -1 on this approach as it feels like a poor way to define a spec.

I think this is a slightly different case in that the spec is written as code, but code that is as clearly written as possible. Clients will contain much more optimized code that will have to handle a lot more special cases.
However, I agree that in many places, the spec would benefit from a bit more verbosity.

specs/simple-serialize.md Outdated Show resolved Hide resolved
specs/simple-serialize.md Outdated Show resolved Hide resolved
specs/simple-serialize.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope:SSZ Simple Serialize
Projects
None yet
Development

Successfully merging this pull request may close these issues.