Update SSZ spec to use SOS style offset based layout. #787

pipermerriam · 2019-03-15T18:43:42Z

What was wrong?

The current SSZ specification uses a sequential layout for serialization. In contexts where computation is expensive such as the EVM, it can be valuable to have capabilities for quick indexing into serialized data without decoding the full message.

The current SSZ spec allows for sub-linear access times but is still probably 2-3x worse than it could be.

How it was fixed.

@karalabe created SOS which can be found here: https://gist.github.com/karalabe/3a25832b1413ee98daad9f0c47be3632

It is designed to lay out the serialized values in such a way that requires minimal decoding to access inner elements.

Old SSZ Example

Under the old/existing SSZ layout the serialized representation of a value with the type {uint32,bytes,bool} would be arranged as:

<total-length-prefix><uint32-bytes><length-prefix><literal-bytes><bool-byte>

The encoded value {1234, b'unicorns', True} under this schema would be.

0x11000000d204000008000000756e69636f726e7301

	total-length-prefix	uint32-bytes	bytes-length-prefix	literal-bytes	bool-byte
`0x`	`11000000`	`d2040000`	`08000000`	`756e69636f726e73`	`01`

New SSZ example using SOS style layout

Under the new SOS style layout, the serialized value would be layed out as:

<uint32-bytes><uint32-offset-for-bytes><bool-bytes><literal-bytes>

Seen as the three sections, it would be

section-1	section-2
`<uint32-bytes><offset-for-bytes><bool-bytes>`	`<literal-bytes>`

The new encoded value {1234, b'unicorns', True} under this schema would be.

0xd20400000900000001756e69636f726e73

Broken down into the individual components

	uint32-bytes	offset-for-bytes	bool-bytes	literal-bytes
`0x`	`d2040000`	`09000000`	`0x01`	`756e69636f726e73`

Why this is better

Encoding and decoding should have very similar if not the same computational complexity. Both the old and new scheme require knowledge of the length of a serialized value to encode, thus requiring that you first encode the value, hold it in memory, compute the length, and then you can commit to the full serialized representation.

What makes this better is that it has a guaranteed upper bound of O(log(N)) complexity for accessing nested elements. This optimization allows for environments with limited computation resources to more easily handle these data structures.

There is also a minor efficiency gain in not requiring size prefixes on top-level-objects.

vbuterin · 2019-03-17T10:37:31Z

I'm inclined to favor this. BTW note that the ratio of added to removed lines in this PR overstates the complexity increase (ie. it looks bad but actually the complexity increase is quite slight), because most of the PR just spends the time defining fixed size and variable size (whereas the existing spec simply gives handwavy definitions) and then providing an English translation of the code.

pipermerriam · 2019-03-18T15:00:21Z

I've rebased this to include the tuple -> vector rename, remove the duplicate definition of fixed/variable size, and cleanup some of the language.

JustinDrake · 2019-03-18T15:05:25Z

specs/simple-serialize.md

@@ -33,6 +33,8 @@ This is a **work in progress** describing typing, serialization and Merkleizatio

 ### Composite types

+Composite types are limited to a maximum of `2**32 - 1` elements.


For Merkleisation purposes this is unnecessarily strict. We want this limit for lists (because of the length mixing), but not for containers and vectors. One example use case is that we want to represent the sparse Merkle trees in shards as vectors of length 2^256.

I feel like I had a reason for this I'm failing to remember it. Removing unless I remember in the next few minutes.

karalabe · 2019-03-18T15:09:46Z

Btw, I think the 3 binary sections can be replaced with two. This was pointed out by @Arachnid too on Twitter. Will need to dig into it again a bit to figure out exactly how that was, don't remember.

pipermerriam · 2019-03-18T15:15:32Z

@karalabe I've been thinking this over too, though I can come up with reasons for both. I'm curious to hear what others think.

Why keep 3 sections
- maybe easier to explain at the spec level, keeping offsets and data distinct.
Why go to 2 sections
- can be viewed as more pure, eliminated an un-necessary extra section that only holds offsets.
- might be slightly more efficient to index into the data since you don't have to parse the types to determine appropriate location of a dynamic length value's offset (most compelling argument I see).

pipermerriam · 2019-03-18T16:28:27Z

I've updated the written spec to only use two segments with the first section being a combination of the serialized fixed-size values and the offsets.

pipermerriam · 2019-03-19T21:42:36Z

@jannikluhn would you mind reviewing this. I'd especially like help making the implementation example as easy to read and understand as possible.

specs/simple-serialize.md

+assert not any(part is None for part in section_1_parts)
+section_1 = ''.join(section_1_parts)
+
+return ''.join([section_1, section_2])


specs/simple-serialize.md

jannikluhn · 2019-03-20T13:21:12Z

Looks like we've got two options for the offset: It could be relative to the beginning of the whole string (as it is right now) or relative to the beginning of the second section. In my opinion, the first option is slightly more elegant (e.g. an offset of zero is meaningful), but the second slightly more practical (a minimal function that extracts a dynamic element doesn't have to know the size of the first section, even though it's a constant for a given type). I don't really have a preference either way.

pipermerriam · 2019-03-20T15:01:19Z

Looks like we've got two options for the offset: It could be relative to the beginning of the whole string (as it is right now) or relative to the beginning of the second section.

I think that it MUST be relative to the beginning of the whole string. Lists of dynamic size elements are why.

When you have a list of dynamic sized elements and the serialized value is not the empty byte string you only know one thing ahead of time.

The first 4 bytes must be a serialized offset.

Beyond that you know nothing about how many elements there are or where the offsets end and the serialized elements start. Once you read the first offset you can know two things.

Where the first serialized element starts
How many elements are in the list (since the first offset also denotes where the offsets end).

You can only know this information if the offsets are from the beginning of the string.

jannikluhn · 2019-03-20T15:05:25Z

I think that it MUST be relative to the beginning of the whole string. Lists of dynamic size elements are why.

Ah, you're right, didn't think of lists.

pipermerriam · 2019-03-20T15:16:27Z

@karalabe would you mind giving this another 👀 now that it's been updated to interleave the fixed size elements and the offsets.

@vbuterin do you want to review this?

What is the process for getting this merged, should it be proposed on the broader Eth2.0 call or is this something we can just move forward with?

specs/simple-serialize.md

-```python
-return "".join([serialize(element) for element in value])
-```
+* The first section is *fixed size* for all types, containing the concatenation of


vbuterin · 2019-03-22T12:06:05Z

The algorithm looks good to me! Seems reasonable to discuss on the next call.

arnetheduck · 2019-03-23T04:23:20Z

specs/simple-serialize.md

-return "".join([serialize(element) for element in value])
-```
+* The first section is fixed-size for all types, containing the concatenation of
+ - The serialized representation for each of the fixed-size elements from the value.


did you consider putting offsets at their respective field locations? this maintains the fixed size of the fixed-size portion, but you'll find data about each field at the "natural" location.

further, it allows you to extend it with new fields more easily without breaking existing data: the new fixed-size fields doesn't change the location of the offsets, and in new data, the offsets jump over the new fields. a length prefix of the fixed-size section is the final piece needed, and a newer schema with more fields can still be used to read old data. this will be very handy when upgrading to eth 2.1.

another common thing and useful thing for easy decoding: alignment. tightly packed or aligned to their size? the latter is better for mmap-style access, if you're simply storing blobs somewhere. it's nice for flash memory and other resource-constrained devices. alignment to size should be very cheap for eth2 - most fields are uint64 or bytes32 and thus already aligned, unless someone accidentally puts a bool in the wrong place, and then the whole object is ruined - having it in the spec makes it easier to rely upon, even though in practice we can use it for most existing types already.

example: for an object whose fields are int, list, int, you'll encode: fixed-length, int, offset, int, list-data

If you read the spec you'll see it defines the encoding just as you suggest, with the offsets intermixed with the fixed data elements.

Right, thanks. I got confused by the bullet points, taking them to mean that all offsets come after all fixed-length fields.

I do not see that the fixed-length portion is length-prefixed - can you think of any other way to handle the following situation: we want to extend int, list, int to int, list, int, int but use the new schema to read the old data.

if there's a variable-length field in there, a trick would be to look at the first offset. if there is none however..

There could be a convention to have a schema id as the first field for objects that may need to be changed in the future (e.g. network messages). Then one could read the first int, choose the appropriate schema for the rest of the data, and continue as normal from there.

can you think of any other way to handle the following situation: we want to extend int, list, int to int, list, int, int but use the new schema to read the old data.

My first instinct is to say this is not supported. My understanding is that we're addressing this at the wire protocol level so that we don't need this kind of compatibility and extending the schema would mean new networking protocol commands with new schemas.

Right, thanks. I got confused by the bullet points, taking them to mean that all offsets come after all fixed-length fields.

This is good information. It suggests that the spec still needs some improvement to make this concept more clear. I'll take another stab at clarifying and cleaning up language tomorrow.

There could be a convention to have a schema id

at the wire protocol level

yes, agree with @pipermerriam that's sufficient for breaking upgrades - I don't think we need anything in this part of the spec for that kind. by breaking, I mean that it's essentially a mechanism for introducing new protocols, which may or may not be similar to existing ones.

however, the introduction of a length field would allow forwards and backwards-compatible upgrades - upgrades where we don't need to break existing readers - incredibly useful for long-lived formats (think x86 machine code encoding level longevity). the length then works a little bit like a smart version - new clients use it to distinguish between new and old data, while old clients continue to work without needing rewrite. the cost is fairly low (4 bytes, or whatever) - it doesn't change streaming or any other properties of the encoding.

the idea is not novel in any way - it's very commonly used in C structs for example when creating public API's with the ability to upgrade internal details in the future.

@arnetheduck I understand that use case but I think I have a strong-ish case against it.

By dictating that the length-prefix of the first section dictates the length of the first section, there becomes a section of bytes between the end of the first section and the beginning of the second section which is undefined and allowed to be anything. It's unclear to me how you would account for this in trie hash. It means that anyone who handles the V2 version of those objects would have to somehow know how to hash them as both versions and likely need to be able to reference them with either hash.

This outlines why I dislike this approach. It reminds me of this: https://linux-audit.com/linux-history-how-dot-files-became-hidden-files/

A seemingly innocuous choice can have significant unforeseen downstream effects. I'm 👎 but totally willing open to being convinced otherwise, though you would have to convince me of the following two things.

There is a compelling reason why the wire protocol level mechanisms for upgrading would not be sufficient

That there is a solution for the trie hashing problem which doesn't introduce required extra complexity into every SSZ library.

I believe SSZ usage, if done well, will leak outside of this phase0 spec and into other areas ETH.

One of the reasons to use SOS that's been mentioned is to access data from contracts efficiently. These are generally hard to upgrade and I believe would benefit from this capability - the ability to keep reading certain blobs without needing to be upgraded.

Passive listeners and tooling that operates in a read-only fashion on the bytes they know about already degrade gracefully.

nodes / entities that filter and forward based on the content of some fields, that don't need access to all data

this is an interesting point, thanks! I think there are at least some mitigating factors - mainly that core clients that produce and verify, and above all, sign messages, will necessarily need to be upgraded to understand what they're singing - thus the bytes can generally not be anything - they need to be something credible in order to be accepted by the network as a whole. How to do the hashing in case you actually need to do that, hm, that would be trickier indeed. Might not be worth the cost, like you say.

pipermerriam

Can this be merged?

djrtwo · 2019-04-05T21:53:40Z

What's the status of an implementation in pyssz or as a minimal implementation here similar to https://github.com/ethereum/eth2.0-specs/blob/dev/utils/phase0/minimal_ssz.py

We need an implementation for this to get into a release so we can generate the appropriate test vectors upon release.

JustinDrake · 2019-04-13T13:47:11Z

Actually @JustinDrake I'm not seeing which cleanups you are referring to

That's because my commits got nuked 😂. Please check my cleanup didn't break things.

pipermerriam · 2019-04-13T15:32:44Z

@JustinDrake it looks like your commits removed the actual text that describes the layout. I'll see about pushing that back in on Monday.

JustinDrake · 2019-04-14T00:19:20Z

To me the text was more confusing than helpful. The code is pretty readable now and the layout is described is the code comments.

Section 1: Serialisation of fixed parts interleaved with 4-byte offsets for the serialisations of variable parts
Section 2: Serialisation of variable parts

Also bear in mind that we will shortly make the SSZ spec executable, similar to the main phase 0 document. In that context, text is mostly limited to code comments.

rjl493456442 · 2019-04-15T02:53:11Z

Hi guys, a question here. Why do we discard the concept of length prefix in this proposal? I think it should be helpful when we decode all variable size types.

Besides what's the default zero value for variable size types?

pipermerriam · 2019-04-15T15:39:38Z

@rjl493456442 length prefix was dropped because it isn't needed. Variable size types either:

have a known length as the length of the encoded bytes when dealt with individually.
have their length defined by the adjacent offset values when dealing with them embedded in another serialized object.

pipermerriam · 2019-04-15T15:45:23Z

@JustinDrake I'm a bit surprised at this.

I can understand wanting to re-word it in a manner that is easier to understand, but fully removing any textual description of the layout feels like it will result in added confusion. Defining the standard as code means that the standard is the functionality of that code along with any bugs or edge cases it happens to not handle correctly. This is what the bitcoin protocol was for a long time (and maybe still is???). I respect this is a much simpler specification but I'm 👎 on this approach as it feels like a poor way to define a spec.

dankrad · 2019-04-16T17:40:38Z

I can understand wanting to re-word it in a manner that is easier to understand, but fully removing any textual description of the layout feels like it will result in added confusion. Defining the standard as code means that the standard is the functionality of that code along with any bugs or edge cases it happens to not handle correctly. This is what the bitcoin protocol was for a long time (and maybe still is???). I respect this is a much simpler specification but I'm -1 on this approach as it feels like a poor way to define a spec.

I think this is a slightly different case in that the spec is written as code, but code that is as clearly written as possible. Clients will contain much more optimized code that will have to handle a lot more special cases.
However, I agree that in many places, the spec would benefit from a bit more verbosity.

specs/simple-serialize.md

pipermerriam changed the title ~~Update SSZ spec to use SOS style offset based layout.~~ [WIP] Update SSZ spec to use SOS style offset based layout. Mar 15, 2019

pipermerriam mentioned this pull request Mar 15, 2019

SOS style layout ethereum/py-ssz#47

Merged

pipermerriam force-pushed the piper/add-sos-style-offset-serialization-to-ssz branch from dfa68c1 to d8b0d32 Compare March 18, 2019 14:57

pipermerriam changed the title ~~[WIP] Update SSZ spec to use SOS style offset based layout.~~ Update SSZ spec to use SOS style offset based layout. Mar 18, 2019

pipermerriam force-pushed the piper/add-sos-style-offset-serialization-to-ssz branch from d8b0d32 to 87f8e05 Compare March 18, 2019 14:59

JustinDrake reviewed Mar 18, 2019

View reviewed changes

pipermerriam force-pushed the piper/add-sos-style-offset-serialization-to-ssz branch from ca9cb8f to d21e781 Compare March 18, 2019 16:26

jannikluhn reviewed Mar 20, 2019

View reviewed changes

pipermerriam requested a review from vbuterin March 20, 2019 15:14

hwwhww added the scope:SSZ Simple Serialize label Mar 21, 2019

jannikluhn reviewed Mar 22, 2019

View reviewed changes

specs/simple-serialize.md Outdated

```python

return "".join([serialize(element) for element in value])

```

* The first section is *fixed size* for all types, containing the concatenation of

This comment was marked as resolved.

Sign in to view

arnetheduck reviewed Mar 23, 2019

View reviewed changes

djrtwo mentioned this pull request Mar 28, 2019

Eth2.0 Implementers Call 15 Agenda ethereum/eth2.0-pm#35

Closed

pipermerriam force-pushed the piper/add-sos-style-offset-serialization-to-ssz branch 3 times, most recently from f0cfc9d to f96bd33 Compare March 28, 2019 14:24

pipermerriam commented Apr 5, 2019

View reviewed changes

JustinDrake mentioned this pull request Apr 6, 2019

[Question][SSZ]To get item count from SSZ data #876

Closed

Update simple-serialize.md

aaa5a16

JustinDrake added 3 commits April 13, 2019 23:48

Update simple-serialize.md

0695d0a

Update simple-serialize.md

35a6311

Update simple-serialize.md

80bd4a3

JustinDrake approved these changes Apr 13, 2019

View reviewed changes

JustinDrake added 11 commits April 13, 2019 23:56

Update simple-serialize.md

10f3db9

Update simple-serialize.md

27cf02a

Update simple-serialize.md

a90bcc0

Update simple-serialize.md

23c0954

Update simple-serialize.md

f6ed1df

Update simple-serialize.md

9adbaba

Update simple-serialize.md

97ca672

Update simple-serialize.md

09d9274

Update simple-serialize.md

7255b0f

Update simple-serialize.md

59f5680

Update simple-serialize.md

62ffb89

oskarth mentioned this pull request Apr 19, 2019

Wire format considerations status-im/bigbrother-specs#10

Open

hwwhww reviewed Apr 19, 2019

View reviewed changes

specs/simple-serialize.md Outdated Show resolved Hide resolved

specs/simple-serialize.md Outdated Show resolved Hide resolved

specs/simple-serialize.md Outdated Show resolved Hide resolved

JustinDrake and others added 2 commits April 24, 2019 15:53

Update simple-serialize.md

1284b93

Merge branch 'dev' into piper/add-sos-style-offset-serialization-to-ssz

7694b9e

djrtwo merged commit 677fd64 into ethereum:dev Apr 24, 2019

wemeetagain mentioned this pull request Apr 30, 2019

Use SOS style layout ChainSafe/ssz-js#79

Merged

mkalinin mentioned this pull request May 1, 2019

Update SSZ implementation with offsets harmony-dev/beacon-chain-java#124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update SSZ spec to use SOS style offset based layout. #787

Update SSZ spec to use SOS style offset based layout. #787

pipermerriam commented Mar 15, 2019 •

edited

Loading

vbuterin commented Mar 17, 2019

pipermerriam commented Mar 18, 2019

JustinDrake Mar 18, 2019

pipermerriam Mar 18, 2019

karalabe commented Mar 18, 2019 •

edited

Loading

pipermerriam commented Mar 18, 2019

pipermerriam commented Mar 18, 2019

pipermerriam commented Mar 19, 2019

This comment was marked as resolved.

This comment was marked as resolved.

jannikluhn commented Mar 20, 2019

pipermerriam commented Mar 20, 2019

jannikluhn commented Mar 20, 2019

pipermerriam commented Mar 20, 2019

This comment was marked as resolved.

vbuterin commented Mar 22, 2019

arnetheduck Mar 23, 2019

arnetheduck Mar 23, 2019

pipermerriam Mar 23, 2019

arnetheduck Mar 23, 2019

jannikluhn Mar 23, 2019

pipermerriam Mar 24, 2019

arnetheduck Mar 24, 2019

pipermerriam Mar 25, 2019

arnetheduck Mar 26, 2019

pipermerriam left a comment

djrtwo commented Apr 5, 2019

JustinDrake commented Apr 13, 2019

pipermerriam commented Apr 13, 2019

JustinDrake commented Apr 14, 2019

rjl493456442 commented Apr 15, 2019 •

edited

Loading

pipermerriam commented Apr 15, 2019

pipermerriam commented Apr 15, 2019

dankrad commented Apr 16, 2019

		@@ -33,6 +33,8 @@ This is a work in progress describing typing, serialization and Merkleizatio

		### Composite types

		Composite types are limited to a maximum of `2**32 - 1` elements.

Update SSZ spec to use SOS style offset based layout. #787

Update SSZ spec to use SOS style offset based layout. #787

Conversation

pipermerriam commented Mar 15, 2019 • edited Loading

What was wrong?

How it was fixed.

Old SSZ Example

New SSZ example using SOS style layout

Why this is better

vbuterin commented Mar 17, 2019

pipermerriam commented Mar 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karalabe commented Mar 18, 2019 • edited Loading

pipermerriam commented Mar 18, 2019

pipermerriam commented Mar 18, 2019

pipermerriam commented Mar 19, 2019

This comment was marked as resolved.

This comment was marked as resolved.

jannikluhn commented Mar 20, 2019

pipermerriam commented Mar 20, 2019

jannikluhn commented Mar 20, 2019

pipermerriam commented Mar 20, 2019

This comment was marked as resolved.

vbuterin commented Mar 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pipermerriam left a comment

Choose a reason for hiding this comment

djrtwo commented Apr 5, 2019

JustinDrake commented Apr 13, 2019

pipermerriam commented Apr 13, 2019

JustinDrake commented Apr 14, 2019

rjl493456442 commented Apr 15, 2019 • edited Loading

pipermerriam commented Apr 15, 2019

pipermerriam commented Apr 15, 2019

dankrad commented Apr 16, 2019

pipermerriam commented Mar 15, 2019 •

edited

Loading

karalabe commented Mar 18, 2019 •

edited

Loading

rjl493456442 commented Apr 15, 2019 •

edited

Loading