Encoding/decoding spec (rfc) #1

henridf · 2022-05-22T15:24:44Z

Outline for "EIP-4444 for pow blocks" (https://notes.ethereum.org/@ralexstokes/BJWd8saB9)

In summary, the proposal is to derive from the Beacon Chain's ExecutionPayload encoding to store blocks, and extend it with uncle headers and a new ReceiptsPayload encoding (defined below) to store receipts. A double-batched accumulator is used to provide one root per block archive file.

Block archive file

A block archive file consists of a header and body that are individually concatenated and encoded into a file.

MAX_PAYLOADS_PER_ARCHIVE = 1000000 # tbd

class ArchiveHeader(Container):
    version: uint64
    head_block_number: uint64
    block_count: uint64

class ArchiveBody(Container):
    blocks: List[Block, MAX_PAYLOADS_PER_ARCHIVE]

Note: initially, both the header fields and the block were in the same container. Splitting them out has two advantages:

Facilitates decoding the header without having to read in the entire file
Facilitates computing the hash tree root over the blocks only.

File encoding and decoding

The encoding tool will take a stream of exported blocks/receipts (RLP-encoded), encode each into the SSZ representations described below, and periodically "flush" a batch into a new ssz-encoded archive file. A simple approach would be to flush every N blocks, but it may be more practical to flush based on the size of the accumulated data, in order to have a smaller number of files. In particular, if we set a constant number of blocks per file, we would have to set that constant that based on recent blocks that are "large" (in order to reach a target file size). That would result in lots of smaller files in the earlier part of the chain history, where blocks are much smaller. I will experiment with this and propose something along with concrete numbers.

Double-batched accumulator

The encoding tool is also responsible for computing and outputting the hash tree roots along the way. There is one root per archive file, which can be used as proofs for an entire file, but also for granular proofs reaching all the way down into individual block values.

It does this via a double-batched accumulator similar to the beacon chain's BeaconState.historical_roots. The one difference is that each successive historical root does not necessarily accumulate the same (constant) number of block roots.

Block

(This section derived from beacon chain's ExecutionPayload, though the resulting structs are different. Notably, uncles and receipts are present here).

Custom types

Name	SSZ equivalent	Description
`Transaction`	`ByteList[MAX_BYTES_PER_TRANSACTION]`	either a typed transaction envelope or a legacy transaction
`ExecutionAddress`	`Bytes20`	Address of account on the execution layer

Constants

Name	Value
`MAX_BYTES_PER_TRANSACTION`	`uint64(2**30)` (= 1,073,741,824)
`MAX_TRANSACTIONS_PER_PAYLOAD`	`uint64(2**20)` (= 1,048,576)
`MAX_UNCLES_PER_PAYLOAD`	`10` (??)
`BYTES_PER_LOGS_BLOOM`	`uint64(2**8)` (= 256)
`MAX_EXTRA_DATA_BYTES`	`2**5` (= 32)

class Block(Container):
    header: Header
    transactions: List[Transaction, MAX_TRANSACTIONS_PER_PAYLOAD]
    uncles: List[Header, MAX_UNCLES_PER_PAYLOAD]
    receipts: List[Receipt, MAX_TRANSACTIONS_PER_BLOCK]

class Header(Container):
    parent_hash: Hash32
    uncle_hash: Hash32   # not in beacon chain's ExecutionPayload
    fee_recipient: ExecutionAddress  # 'beneficiary' in the yellow paper
    state_root: Bytes32
    tx_hash: Hash32 # not in beacon chain's ExecutionPayload
    receipts_root: Bytes32
    logs_bloom: ByteVector[BYTES_PER_LOGS_BLOOM]
    prev_randao: Bytes32  # 'difficulty' in the yellow paper
    block_number: uint64  # 'number' in the yellow paper
    gas_limit: uint64
    gas_used: uint64
    timestamp: uint64
    extra_data: ByteList[MAX_EXTRA_DATA_BYTES]
    base_fee_per_gas: uint256
    mix_digest: Bytes32 # not in beacon chain's ExecutionPayload
    nonce: uint64 # not in beacon chain's ExecutionPayload
    # block_hash: Hash32   -- not sure we need this? # Hash of execution block

Receipts payload

Constants

Name	Value
`MAX_TOPICS_PER_LOG`	`uint8(2**2)` (= 4)
`MAX_LOGS_PER_PAYLOAD`	`uint64(2**20)` (= 1,048,576)
`BYTES_PER_LOGS_BLOOM`	`uint64(2**8)` (= 256)
`MAX_LOG_DATA_BYTES`	`uint64(2**22)` (= 4,194,304)

class Receipt(Container):
   post_state: Bytes32
   status: uint64
   cumulative_gas_used: uint64
   logs: List[Log, MAX_LOGS_PER_PAYLOAD]



class LogPayload(Container):
   address: ExecutionAddress
   topics: List[Hash32, MAX_TOPICS_PER_LOG]
   data: ByteList[MAX_LOG_DATA_BYTES]

The text was updated successfully, but these errors were encountered:

lightclient · 2022-05-24T04:43:35Z

Looks good overall! Just a few thoughts:

Naming -- we can probably spend some time bike shedding this. "Archive" is a word I would try to avoid as it will create a bit of collision with "archive node".
N blocks vs uniform size -- I struggle to think of persuasive arguments either way. With that said, I do think I lean towards the uniform size approach. Curious to see what comes of your experiments here.

henridf · 2022-06-05T09:40:04Z

N blocks vs uniform size -- I struggle to think of persuasive arguments either way. With that said, I do think I lean towards the uniform size approach. Curious to see what comes of your experiments here.

@lightclient a few comments on data / file size now that I've done some measurements.

The overall size of blocks+receipts from origin to current head is roughly 700GB uncompressed, 300GB compressed.
We need an upper limit S on individual file sizes. Let's say something like 4GB or maybe 8GB. The constraint driving this limit is that ~~these files must be entirely read into memory to be ssz-decoded~~ (update: Encoding/decoding spec (rfc) #1 (comment)).
Given this upper limit, there are two possible approaches:
1. Variable block counts, fixed file sizes: Read as many block counts as fits into the S size limit, then emit a file. So we would have ~ 700/S files.
2. Uniform block counts, variable file sizes: Read N blocks, then emit a file. What is N? It's value would have to be based on recent block sizes, which are larger. At the 14 millionth block range, 10k blocks takes order of 1.4GB storage. So to store the first 15m blocks, we would need 1500 * 1.4/S GB. In other words, for S=8GB, we would be talking about roughly 90 files for variable block counts vs roughly 260 files with uniform block counts.

Overall I would mildly lean for variable block counts, as it leads to fewer and uniformly-sized files, but I'm ok going with either approach if there is a preference for uniform.

lightclient · 2022-06-05T10:06:54Z

@henridf it might be worth posting in the eth r&d discord for some opinions. Maybe even asking on ACD. It's one of those simple decisions that could have a relatively big impact, so we may as well get as many opinions on it as possible. I still don't think I have strong opinion on it.

the overall size of blocks+receipts from origin to current head is roughly 700GB uncompressed, 300GB compressed

Do you know what format geth stores them on-disk? Last fall I did a geth inspect and there were was only ~250 GB of history, so a bit surprised with the 700 GB number!

We need an upper limit S on individual file sizes.

Again, an important decision, but I don't think I lean strongly in either direction. I think I empathize the most with minimizing the size so that the constraints we impose on memory aren't too large. But memory gets cheaper by the day and blocks bigger by the year, so may be preferable to be more future-proof and err on the larger side. Would also be a good discord / ACD question.

henridf · 2022-06-05T10:23:19Z

I agree this is an important decision deserving broader exposure. I'd like to make a little progress on the accumulator and verifying to see if that informs anything. (I'll be starting this next now that encoding is mostly implemented at https://github.com/henridf/eip44s-proto/tree/wip-v2).

Do you know what format geth stores them on-disk? Last fall I did a geth inspect and there were was only ~250 GB of history, so a bit surprised with the 700 GB number!

Yep, the vast majority are in the freezer which uses snappy-compression for these tables (https://github.com/ethereum/go-ethereum/blob/997f1c4f0abcd78f645e6e7ced6db4b42ad59c9d/core/rawdb/schema.go#L133).

From compressing a few history files with gzip, I saw a 2.5x-ish compression gain, while geth inspect shows 335GB as of recently. So the numbers line up pretty well.

| Ancient store   | Headers            | 6.60 GiB   |   14807920 |
| Ancient store   | Bodies             | 215.84 GiB |   14807920 |
| Ancient store   | Receipt lists      | 104.95 GiB |   14807920 |

lightclient · 2022-06-05T11:51:32Z

Makes sense! Thanks

henridf · 2022-06-08T06:57:57Z

Another argument (from @karalabe, relayed by @lightclient) for variable block counts, fixed file sizes: It avoids situations where a chain has many many empty blocks leading to needless additional files

henridf · 2022-06-09T16:37:02Z

The constraint driving this limit is that these files must be entirely read into memory to be ssz-decoded. (Which also means that this limit is for uncompressed file sizes).

An update / correction on this: with the revamped layout, it would be feasible to write a custom ssz streaming decoder for the List[Block, MAX_PAYLOADS_PER_ARCHIVE] that would read in all the ssz offsets, then stream in one block at a time. These could be fed to a rlp writer (when converting to RLP) or to a custom hash-tree-root implementation.

Of course this is some one-off work wrt e.g. using the fastssz-generated marshalling, but once the format is settled it would be worthwhile. Most importantly, we probably don't need to worry so much about erring on the size of small files.

henridf changed the title ~~Next step outline~~ Encoding/decoding v2 May 23, 2022

henridf changed the title ~~Encoding/decoding v2~~ Encoding/decoding spec (rfc) Jun 8, 2022

henridf mentioned this issue Jun 9, 2022

Ethereum Core Devs Meeting 140 Agenda ethereum/pm#538

Closed

timbeiko mentioned this issue Jun 14, 2022

Ethereum Core Devs Meeting 141 Agenda ethereum/pm#551

Closed

timbeiko mentioned this issue Jun 27, 2022

Ethereum Core Devs Meeting 142 Agenda ethereum/pm#562

Closed

timbeiko mentioned this issue Jul 8, 2022

Ethereum Core Devs Meeting 143 Agenda ethereum/pm#572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding/decoding spec (rfc) #1

Encoding/decoding spec (rfc) #1

henridf commented May 22, 2022 •

edited

Loading

lightclient commented May 24, 2022

henridf commented Jun 5, 2022 •

edited

Loading

lightclient commented Jun 5, 2022

henridf commented Jun 5, 2022 •

edited

Loading

lightclient commented Jun 5, 2022

henridf commented Jun 8, 2022

henridf commented Jun 9, 2022

Encoding/decoding spec (rfc) #1

Encoding/decoding spec (rfc) #1

Comments

henridf commented May 22, 2022 • edited Loading

Block archive file

File encoding and decoding

Double-batched accumulator

Block

Custom types

Constants

Receipts payload

lightclient commented May 24, 2022

henridf commented Jun 5, 2022 • edited Loading

lightclient commented Jun 5, 2022

henridf commented Jun 5, 2022 • edited Loading

lightclient commented Jun 5, 2022

henridf commented Jun 8, 2022

henridf commented Jun 9, 2022

henridf commented May 22, 2022 •

edited

Loading

henridf commented Jun 5, 2022 •

edited

Loading

henridf commented Jun 5, 2022 •

edited

Loading