simple_serialise - not possible to parse the structure without type information #94

AlexeyAkhunov · 2018-09-13T16:51:48Z

Due to the special encoding of 'hash32' and 'address' without any prefix, it will not be possible to parse any simple_serialise messages without knowing the type. This might seem to be OK, but precludes writing of generic parsers that do not depend on the type being parsed. The consequence that any tools/libraries that look at traffic and analyse it has to be type-dependent and will have to be updated any time types change. That will create unnecessarily tight coupling between serialisation format and the tooling.

Also, list and 'type' encodings look like a length-prefixed blob, and without type information it is impossible to see how many elements/fields they have and the boundaries of these elements/fields.

RLP does not have this drawback, for example. But it has its own issues, of course.

Another concern is that serialised messages can mean different things depending on the schema. That is more of a safety/security concern - if we have software of two slightly incompatible version, we can have very hard to find bugs due to misinterpreted messages. Generally, it should be possible to have a high assurance that changing the schema will not successfully parse any messages that were produced by a different schema (or at least a schema of the previous version).

paulhauner · 2018-09-14T00:24:36Z

Hi @AlexeyAkhunov, thanks for the input.

I think it's fine that you cannot parse a SSZ message without already knowing its intended type.

We're sending Blocks and AttestationRecords across the network, so there's not many different types we need to accommodate for. Also, the nature of this application (a blockchain) means that all nodes already have a very strong agreement upon the data types and serialization format. Any change in the types or their serialization format is a "big deal" and all nodes will need to explicitly agree upon this -- if a node changes the either of these at any point they will hash to different values and we will lose consensus. This is distinct to something like protobuf on some RPC where it's fine to have different byte-level representations across different versions as long they describe the same object.

The message on the network is going to have some byte(s) to identify the message type as a block , attestation record, etc. Off the top of my head, accidentally mislabeling one of these objects and having it de-serialize as a valid, consensus-influencing object of a different type would either be impossible or similar to finding several hash collisions.

I think it's important to note that there are at least two different requirements for serialization we're discussing:

Consensus serialization: how to serialize the bytes that get hashed and signed.
p2p serialization: how to serialize the bytes that get sent over the network.

What I have been discussing relates to consensus serialization. I think you're discussing p2p serialization @AlexeyAkhunov which could be distinct to consensus serialization, but I don't think there are many gains to this -- especially considering that one of the first things a node will do with some block when they receive it is hash it according its consensus serialization.

There's also a serialization format that will be involved with RPC endpoints in the node and I think that should be something like protobuf. However, that's not really being discussed at this stage.

jannikluhn · 2018-09-14T07:53:58Z

We're sending Blocks and AttestationRecords across the network, so there's not many different types we need to accommodate for.

This is true for now, but probably not in the long term. Additional types from the top of my head:

EVM Transactions
Potentially transactions targeting other execution engines
Shard blocks
Shard headers
State requests and responses for light clients
Proof of custody challenges and responses

mratsim · 2018-09-14T08:19:23Z

The alternatives we considered all have their trade-offs:

External libs: issue of a canonical representation for hashing, would require external tooling for inspection as well
Including the schema in the communication: not space efficient at all https://ethresear.ch/t/discussion-p2p-message-serialization-standard/2781/14
RLP - https://github.com/ethereum/wiki/wiki/Wishlist#rlp

I'm not familiar with the field but network analysis tool should allow plugins, so we could create add that to a bounty/wishlist: SSZ plugins for common open-source network analysis tools.

I.e. have a robust base layer and build tooling on top.

AlexeyAkhunov · 2018-09-14T08:54:50Z

@paulhauner Thank you for your comment! I had an impression from the Eth 2.0 call on 13.09.2018 that simple_serialise had been proposed for p2p serialisation (since p2p serialisation was on the agenda).

As for consensus serialisation (serialise things before they get hashed and signed - and I assume you are saying that deserialisation will not actually happen), one of the main requirements there is to be able to avoid excessive dynamic memory allocations during the serialisation. To achieve that, serialisers might want to compute the resulting serialisation size efficiently, before starting the serialisation. Or, alternatively, if we are using sponge-based hash functions (like Keccak) that can consume streams of data as input. In the latter case, prepending the serialisation with the length prefix actually makes it harder, because you cannot start the stream until you have computed the whole serialised string.

Therefore, if simple_serialise is to be used for one-way serialisation (input to hash functions, therefore no deserialisation is required), then I suggest dropping length prefixes, because that would allow streaming input to the sponge hash functions like Keccak

AlexeyAkhunov · 2018-09-14T09:00:59Z

@mratsim Yes, I agree: make schema mandatory serialisation/deserialisation has many problematic consequences. It is not memory efficient, as you said, it precludes generic analysis tools and libraries, it introduces potential for mis-interpretation, and makes the consideration of backwards compatibility much harder. And it also mingles syntax and semantics of messages into one, inseparable into different layers of codes/decoders

AlexeyAkhunov · 2018-09-14T09:15:52Z

RLP is actually good serialisation format for wire encoding. I do not agree that it is overly complicated - but I have seen some implementations that make it look complicated.

I do agree that RLP is not great to generate input for hashing. Therefore, I suggest using RLP for wire messages, but for making inputs for hashes use simple serialisation streams without length prefixes (because length prefixes force us to buffer input instead of simple start hashing it). Since no-one is going to undo the hashes and deserialise, it is not important to have length prefixes

hwwhww · 2018-09-18T06:38:38Z

@paulhauner Do you have the benchmark of SSZ v.s. RLP?

mkalinin · 2018-09-18T06:49:09Z

Do we want to support two serializations? Does the outcome of introducing e.g. SSZ is so big to maintain both RLP and SSZ or ProtoBuf and SSZ?

paulhauner · 2018-09-18T07:17:02Z

@jannikluhn good point about future data types. Thanks.

@AlexeyAkhunov no worries, happy to talk about this. With regards to the ssz length prefixes, I don't see a need to remove them before we hash. I think the entire SSZ byte-array should be hashed.

@hwwhww we updated the serialization_sandbox notes with RLP message sizes :)

@mkalinin I am in favor of using SSZ for both p2p serialization and consensus serialization because it's pretty small and I think one of the very first things we want to do with some block/attestation record when we get it off the wire is hash it. Having two formats would require some translation. I'm totally open to discussion on this though.

mkalinin · 2018-09-18T07:34:46Z

We are currently using RLP as it has already been implemented in EthereumJ. I agree that RLP is over complicated in terms of implementation and open for changing it. But there is no sense to me in supporting several serializations, it sounds even more complicated than implementing RLP.
So, it's OK to change serialization algorithm but the new one should handle all the cases and replace RLP all over the codebase. IMHO, if something that we want to switch to doesn't fit some of the cases then we should consider another algorithm.

paulhauner · 2018-09-18T07:43:41Z

@mkalinin I totally agree.

paulhauner · 2018-09-27T15:33:18Z

Hi @AlexeyAkhunov, I didn't make the connection between yourself and TurboGeth earlier. Knowing this, I'll give your proposal much more thought as your knowledge on the subject is more extensive than mine.

jannikluhn · 2018-10-01T12:38:30Z

The two arguments in favor of using the same serialization format for networking and hashing are

simplicity: Two serialization formats are more complicated than one
performance: If the same format is used, a received packet can be hashed (almost) immediately without an additional de+serialization step, which is non-negligible for less performant languages such as Python

I'm wondering if both arguments are valid if we go with the libp2p daemon: I assume libp2p has its own transport serialization format and considering that it's implemented in Go this should also not be our bottleneck. It's also fully abstracted away from us, so it doesn't affect simplicity of the Eth2.0 client itself.

So the relevant question becomes which serialization format do/should we use to communicate with the daemon? There we have of course different trade offs, in particular we don't need to care about size so much as the messages aren't transmitted via the network. @raulk Has this been specified already?

mkalinin · 2018-10-01T13:30:23Z

@AlexeyAkhunov could you, please, outline your thoughts on what requirements both algorithms should satisfy? Maybe you even have some algorithms to take a look at? It would help to get why SSZ is not suitable for network serialization from your point of view and raise further discussion on alternatives.

raulk · 2018-10-01T13:58:21Z

@jannikluhn libp2p is agnostic of the wire format of messages. The control API of the daemon does use protobuf, but that should be well encapsulated inside the binding, e.g. py-libp2p-binding. In other words, it's only used to exchange commands and control signals like: "open connection", "open stream", "new incoming stream", DHT operations, etc. between the binding and the daemon.

The application itself (Eth2.0 implementations) can select whatever wire format is best for them. The daemon just pipes raw reads and writes done on the local unix socket representing the stream, into reads and writes on the underlying libp2p stream.

jannikluhn · 2018-10-01T14:05:41Z

Thanks for clearing that up, @raulk! Disregard my post then as we still have to decide on a wire format by ourselves.

fjl · 2018-11-09T16:58:12Z

IMHO the network should relay consensus objects without re-encoding them. These objects can be represented as byte arrays in any serialization format. A network message defined as a protocol buffer can include a 'block' object as a blob.

To solve the issue that started this discussion, it would be best to define consensus objects as an ssz (or rlp) blob with a known header including a version number, e.g. a block would be encoded as "block-v1-..." with ... being the ssz blob.

hwwhww added the ssz label Sep 14, 2018

hwwhww mentioned this issue Sep 22, 2018

TODO: SimpleSerialize (SSZ) spec ethereum/consensus-specs#2

Closed

mratsim mentioned this issue Oct 2, 2018

[WIP] Initial SimpleSerialize (SSZ) spec ethereum/consensus-specs#18

Merged

rawfalafel mentioned this issue Nov 14, 2018

Proposal to use SSZ for consensus only ethereum/consensus-specs#129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple_serialise - not possible to parse the structure without type information #94

simple_serialise - not possible to parse the structure without type information #94

AlexeyAkhunov commented Sep 13, 2018

paulhauner commented Sep 14, 2018 •

edited

Loading

jannikluhn commented Sep 14, 2018

mratsim commented Sep 14, 2018 •

edited

Loading

AlexeyAkhunov commented Sep 14, 2018

AlexeyAkhunov commented Sep 14, 2018

AlexeyAkhunov commented Sep 14, 2018

hwwhww commented Sep 18, 2018

mkalinin commented Sep 18, 2018

paulhauner commented Sep 18, 2018

mkalinin commented Sep 18, 2018

paulhauner commented Sep 18, 2018

paulhauner commented Sep 27, 2018

jannikluhn commented Oct 1, 2018

mkalinin commented Oct 1, 2018

raulk commented Oct 1, 2018 •

edited

Loading

jannikluhn commented Oct 1, 2018

fjl commented Nov 9, 2018 •

edited

Loading

simple_serialise - not possible to parse the structure without type information #94

simple_serialise - not possible to parse the structure without type information #94

Comments

AlexeyAkhunov commented Sep 13, 2018

paulhauner commented Sep 14, 2018 • edited Loading

jannikluhn commented Sep 14, 2018

mratsim commented Sep 14, 2018 • edited Loading

AlexeyAkhunov commented Sep 14, 2018

AlexeyAkhunov commented Sep 14, 2018

AlexeyAkhunov commented Sep 14, 2018

hwwhww commented Sep 18, 2018

mkalinin commented Sep 18, 2018

paulhauner commented Sep 18, 2018

mkalinin commented Sep 18, 2018

paulhauner commented Sep 18, 2018

paulhauner commented Sep 27, 2018

jannikluhn commented Oct 1, 2018

mkalinin commented Oct 1, 2018

raulk commented Oct 1, 2018 • edited Loading

jannikluhn commented Oct 1, 2018

fjl commented Nov 9, 2018 • edited Loading

paulhauner commented Sep 14, 2018 •

edited

Loading

mratsim commented Sep 14, 2018 •

edited

Loading

raulk commented Oct 1, 2018 •

edited

Loading

fjl commented Nov 9, 2018 •

edited

Loading