Chunk-Based Code Merkleization (ethereum#2926)

This EIP specifies a chunk-based approach to code merkleization with the goal of reducing block witness size in a stateless/partial-stateless ethereum.
Arachnid · Mar 6, 2021 · e498fff · e498fff
1 parent 31a688b
commit e498fff
Showing 1 changed file with 188 additions and 0 deletions.
diff --git a/EIPS/eip-2926.md b/EIPS/eip-2926.md
@@ -0,0 +1,188 @@
+---
+eip: 2926
+title: Chunk-Based Code Merkleization
+author: Sina Mahmoodi (@s1na), Alex Beregszaszi (@axic)
+discussions-to: https://ethereum-magicians.org/t/eip-2926-chunk-based-code-merkleization/4555
+status: Draft
+type: Standards Track
+category: Core
+created: 2020-08-25
+requires: 161, 170, 2584
+---
+
+## Abstract
+
+Code merkleization, along with binarification of the trie and gas cost bump of state accessing opcodes, are considered as the main levers for decreasing block witness sizes in stateless or partial-stateless Eth1x roadmaps. Here we specify a fixed-sized chunk approach to code merkleization, outline how the transition of existing contracts to this model would look like, and pose some questions to be considered.
+
+## Motivation
+
+Bytecode is currently the [second contributor](https://github.com/mandrigin/ethereum-mainnet-bin-tries-data) to block witness size, after all the proof hashes. Transitioning the trie from hexary to binary reduce the hash section of the witness by 3x, thereby making code the first contributor. By breaking contract code into chunks and committing to those chunks in a merkle tree, stateless clients would only need the chunks that were touched during a given transaction to execute it.
+
+## Specification
+
+This specification assumes that [EIP-2584](https://eips.ethereum.org/EIPS/eip-2584) is deployed, and the merkleization rules and gas costs are proposed accordingly. What follows is structured to have three sections:
+
+1. How a given contract code is split into chunks and then merkleized
+2. How to merkleize all existing contract codes during a hardfork
+3. What EVM changes are required to account for the new way of representing code
+
+### Constants
+
+- `CHUNK_SIZE`: 32 (bytes)
+- `KEY_LENGTH`: 4 (bytes)
+- `METADATA_KEY`: `0xffffffff`
+- `METADATA_VERSION`: 0
+- `EMPTY_CODE_ROOT`: `0xc5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470` (==`keccak256('')`)
+- `HF_BLOCK_NUMBER`: to be defined
+
+### Code merkleization
+
+The procedure for merkleizing a contract code starts with splitting it into chunks. To do this assume a function that takes the bytecode as input and returns a list of chunks as output, where each chunk is a tuple `(startOffset, firstInstructionOffset, codeChunk)`. *Note that `startOffset` corresponds to the absolute program counter, and `firstInstructionOffset` is the offset within the chunk to the first instruction, as opposed to data.* This function runs in two passes:
+
+- **First Pass**: Divide the bytecode evenly into `CHUNK_SIZE` parts. The last part is allowed to be shorter. For each chunk, create the tuple, leaving `firstInstructionOffset` as `0`, and add the tuple to an array of `chunks`. Also set `chunkCount` to the total number of chunks.
+
+- **Second Pass**: Scan the bytecode for instructions with immediates or multi-byte instructions (currently only `PUSHN`, opcodes `0x60 .. 0x7f`). Let `PC` be the current opcode position and `N` the number of immediate bytes following, where `N > 0`. Then:
+
+    - **If** `(PC + N) % CHUNK_SIZE > PC % CHUNK_SIZE` (i.e. instruction does not cross chunk boundary), **continue** to next instruction.
+    - **Else**, compute the index of the chunk after the current instruction: `chunkIdx = Math.floor((PC + N + 1) / CHUNK_SIZE)`.
+        - **If** `chunkIdx + 1 >= chunkCount` then **break** the loop. This is to handle the following case: in a malformed bytecode, or the "data section" at the end of a Solidity contract, there could be a byte *resembeling* an instruction whose immediates exceed bytecode length.
+        - **Else**, assuming `nextChunk = chunks[chunkIdx]`, assign `(PC + N + 1) - nextChunk.startOffset` to `nextChunk.firstInstructionOffset`.
+
+The `firstInstructionOffset` fields allows safe jumpdest analysis when a client doesn't have all the chunks, e.g. a stateless clients receiving block witnesses. 
+
+To build a Merkle Patricia Tree we iterate `chunks` and process each entry, `C`, as follows:
+
+- Let the MPT key `k` be `C.startOffset` in big-endian extended with zero padding on the left to reach `KEY_LENGTH`. Note that the code trie is not vulnerable to grinding attacks, and therefore the key is not hashed as in the account or storage trie.
+- Let the MPT value `v` be `RLP([firstInstructionOffset, C.code])`.
+- Insert `(k, v)` into the `codeTrie`.
+
+We further insert a metadata leaf into `codeTrie`, with the key being `METADATA_KEY` and value being `RLP([METADATA_VERSION, codeHash, codeLength])`. This metadata is added mostly to facilitate a cheaper implementation of `EXTCODESIZE` and `EXTCODEHASH`. It includes a version to allow for differentiating between bytecode types (e.g. [EVM1.5/EIP-615](https://eips.ethereum.org/EIPS/eip-615), [EIP-2315](https://eips.ethereum.org/EIPS/eip-2315), etc.) or code merkleization schemes (or merkleization settings, such as larger `CHUNK_SIZE`) in the future.
+
+The root of this `codeTrie` is computed according to the [EIP-2584](https://eips.ethereum.org/EIPS/eip-2584) rules (similar to the account trie) and stored in the account record, replacing the current `codeHash` field (henceforth called `codeRoot`). For accounts with empty code (including Externally-owned-Accounts aka EoAs) `codeRoot` will have the value `EMPTY_CODE_ROOT` (as opposed to empty `codeTrie` root hash) to maintain backwards compatibility.
+
+### Updating existing code (transition process)
+
+The transition process involves reading all contracts in the state and apply the above procedure to them. A benchmark showing how long this process will take is still pending, but intuitively it should take longer than the time between two blocks (in the order of hours). Hence we recommend clients to pre-process the changes before the EIP is activated.
+
+Code has the nice property that it is (mostly) static. Therefore clients can maintain a mapping of `[accountAddress -> codeRoot]` which stores the results for the contracts they have already merkleized. During this pre-computation phase, whenever a new contract is created its `codeRoot` is computed, and added to the mapping. Whenever a contract self-destructs, its corresponding entry is removed.
+
+At block `HF_BLOCK_NUMBER` when the EIP gets activated, before executing any transaction the client must update the account record for all contracts with non-empty code in the state to replace the `codeHash` field with the pre-computed `codeRoot`. EoA accounts will keep their `codeHash` value as `codeRoot`. *Accounts with empty code will keep their `codeHash` value as `codeRoot`.*
+
+### Opcode and gas cost changes
+
+It is not directly a part of consensus how clients store the merkleized code. However the gas cost analysis in this section assumes a specific way of storing it. In this approach clients only merkleize code once during creation to compute `codeRoot`, but then discard the chunks. They instead store the full bytecode as well as its hash and size in the database. Otherwise the gas scheme would have to be modified to charge per chunk accessed or conservatively charge for traversing the whole code trie. We believe per-chunk metering for calls would be more easily solvable by witness metering in the stateless model.
+
+#### Contract creation cost
+
+Contract creation –– whether it is initiated via `CREATE`, `CREATE2`, or an external transaction –– will execute an *initcode*, and the bytes returned by that initcode is what becomes the account code stored in the state. As of now there is a charge of 200 gas per byte of that code stored. *We consider this cost to represent the long-term code storage cost, which could be reviewed under the "Stateless Ethereum" paradigm.*
+
+We introduce an additional charge to account for the chunking and merkleization cost: `(ceil(len(code) / CHUNK_SIZE) + 1) * COST_PER_CHUNK`, where `COST_PER_CHUNK` is **TBD**. We also add a new fixed charge for `COST_OF_METADATA`, which equals to the cost of `rlpEncode([0, codeHash, codeSize])`.
+
+The exact value of `COST_PER_CHUNK` is to be determined, but we estimate it will be on the order of `CHUNKING_COST + (2 * HASHING_COST) + (ceil(log2(NUM_CHUNKS)) * TREE_BUILDING_COST)`, where `HASHING_COST` is 36 (`Gkeccak256 + Gkeccak256_word`) and we have no estimate about `CHUNKING_COST` and `TREE_BUILDING_COST`. The hash cost could be reduced if the [EIP-2666](https://github.com/ethereum/EIPs/pull/2666) proposal is accepted.
+
+Preliminary cost comparison with `CHUNKING_COST=60`, `TREE_BUILDING_COST=30` and `COST_OF_METADATA=20` (note these values are *not* informed by benchmarks):
+
+|             | Pre-EIP | Post-EIP | Increase |
+|-------------|---------|----------|----------|
+| 1 byte      |     200 |      353 |   76.50% |
+| 16000 bytes | 3200000 |  3401422 |    6.29% |
+| 24576 bytes | 4915200 |  5247428 |    6.75% |
+
+#### CALL, CALLCODE, DELEGATECALL, STATICCALL, EXTCODESIZE, EXTCODEHASH, and EXTCODECOPY
+
+We expect no gas cost change. Clients have to perform the same lookup as before.
+
+#### CODECOPY and CODESIZE
+
+There should be no change to `CODECOPY` and `CODESIZE`, because the code is already in memory, and the loading cost is paid by the transaction/call which initiated it.
+
+### Other remarks
+
+[EIP-161](https://eips.ethereum.org/EIPS/eip-161) defines:
+> An account is considered empty when it has no code and zero nonce and zero balance.
+
+We replace this statement with:
+> An account is considered empty when its `codeRoot` equals `EMPTY_CODE_ROOT` and has zero nonce and zero balance.
+
+## Rationale
+
+### Hexary vs binary trie
+
+The Ethereum mainnet state is encoded as of now in a hexary Merkle Patricia Tree. As part of the Eth1x roadmap, a transition to a [binary trie](https://ethresear.ch/t/binary-trie-format/7621) has been [investigated](https://medium.com/@mandrigin/stateless-ethereum-binary-tries-experiment-b2c035497768) with the goal of reducing witness sizes. Because code chunks are also stored in the trie, this EIP would benefit from the witness size reduction offered by the binarification. Therefore we have decided to explicitly state [EIP-2584](https://eips.ethereum.org/EIPS/eip-2584) to be a requirement of this change. Note that if code merkleization is included in a hard-fork beforehand, then all code must be re-merkleized after the binary transition.
+
+### Rationale behind chunk size
+
+The current recommended chunk size of 32 bytes has been selected based on a few observations. Smaller chunks are more efficient (i.e. have higher [chunk utilization](https://ethresear.ch/t/some-quick-numbers-on-code-merkelization/7260/3)), but incur a larger hash overhead (i.e. number of hashes as part of the proof) due to a higher trie depth. Larger chunks are less efficient, but incur less hash overhead. We plan to run a larger experiment comparing various chunk sizes to arrive at a final recommendation.
+
+### Different chunking logic
+
+We have considered an alternative option to package chunks, where each chunk is prepended with its `chunkLength` and would only contain complete opcodes (i.e. any multi-byte opcode not fitting the `CHUNK_SIZE` would be deferred to the next chunk).
+
+This approach has downsides compared to the one specified:
+1) Requires a larger `CHUNK_SIZE` -- at least 33 bytes to accommodate the `PUSH32` instruction.
+2) It is more wasteful. For example, `DUP1 PUSH32 <32-byte payload>` would be encoded as two chunks, the first chunk contains only `DUP1`, and the second contains only the `PUSH32` instruction with its payload.
+3) Calculating the number of chunks is not trivial and would have to be stored explicitly in the metadata.
+
+Additionally we have reviewed many other options (basic block based, Solidity subroutines (requires determining the control flow), EIP-2315 subroutines). This EIP however only focuses on the chunk-based option.
+
+### The use of RLP
+
+RLP was decided for, because it is used extensively in the data structures of Ethereum. Should there be a push for a different serialisation format, this proposal could be easily amended to use that. It must be noted that some have proposed (on the [Stateless Ethereum Call #8](https://github.com/ethereum/pm/issues/199)) to consider a different format at the same time binarification is applied to the chain.
+
+### Alternative to the metadata
+
+Instead of encoding `codeHash` and `codeSize` in the metadata, they could be made part of the account. In our opinion, the metadata is a more concise option, because EoAs do not need these fields, resulting in either additional logic (to omit those fields in the accounts) or calculation (to include them in merkleizing the account).
+
+### Versioning
+
+This proposal includes a `metadataVersion` field in the metadata, which can be used to distinguish both various merklization rules, as well as bytecodes.
+
+An alternative option would be to move this versioning to the account level: either following [EIP-1702](https://eips.ethereum.org/EIPS/eip-1702), or by adding an `accountKind` field (with potential options: `eoa`, `merkleized_evm_chunk32`, `merkleized_eip2315_chunk64`, etc.) as the first member of the account. One benefit this could provide is omitting `codeHash` for EoAs.
+
+### The keys in the code trie (and `KEY_LENGTH`)
+
+As explained in the specification above, the keys in the code trie are bytecode offsets. Since valid bytecode offsets start at `0`, there is no natural way to represent a "special key" needed for the metadata.
+
+Even though the current bytecode limit is `0x6000` (as per [EIP-170](https://eips.ethereum.org/EIPS/eip-170)), we have decided to have a hard cap of `2^32-2` (influenced by [EIP-1985](https://eips.ethereum.org/EIPS/eip-1985)), which makes the proposal compatible with proposals increasing, or "lifting" the code size limit.
+
+Therefore it is safe to use `0xffffffff` as the key for the metadata.
+
+### Index-based alternative for the keys in the code trie
+
+Since the chunks are fixed-size, just by knowing the index of a chunk we can compute its `startOffset = chunk_idx * CHUNK_SIZE`. Therefore it's not a necessity to encode this information in the key. We are reviewing this alternative both for simplicity and potentially lower tree building costs. By using indices as keys the structure of the trie is known only given the number of chunks, which means less analysis. Additionally, the metadata could be included at index 0, removing the need for a large, fixed-size and zero-padded, key.
+
+### Alternate values of codeRoot for EoAs
+
+This proposal changes the meaning of the fourth field (`codeHash`) of the account. Prior to this change, that field represents the Keccak-256 hash of the bytecode, which is logically hash of an empty input for EoAs.
+
+Since `codeHash` is replaced with `codeRoot`, the root hash of the code trie, the value would be different for EoAs under the new rules: the root of the `codeTrie(metadata=[codeHash=keccak256(''), codeSize=0])`. An alternative would be simply using the hash of an empty trie. Or to avoid introducing yet another constant (the result of the above), one could also consider using `codeRoot = 0` for EoAs.
+
+However, we wanted to maintain compatibility with [EIP-1052](https://eips.ethereum.org/EIPS/eip-1052) and decided to keep matters simple by using the hash of empty input (`c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470`) for EoAs.
+
+## Backwards Compatibility
+
+From the perspective of contracts, the design aims to be transparent, with the exception of changes in gas costs.
+
+Outside of the interface presented for contracts, this proposal introduces major changes to the way contract code is stored, and needs a substantial modification of the Ethereum state. Therefore it can only be introduced via a hard fork.
+
+## Test Cases
+
+TBD
+
+Show the `codeRoot` for the following cases:
+
+1. `code=''`
+2. `code='PUSH1(0) DUP1 REVERT'`
+3. `code='PUSH32(-1)'` (data passing through a chunk boundary)
+
+## Implementation
+
+The implementation of the chunking and merkleization logic in Typescript can be found [here](https://github.com/ewasm/biturbo/blob/merklize-mainnet-blocks/src/relayer/bytecode.ts#L172), and in Python [here](https://github.com/hugo-dc/code-chunks/blob/master/main.py). Please note neither of these examples currently use a binary tree.
+
+## Security Considerations
+<!--All EIPs must contain a section that discusses the security implications/considerations relevant to the proposed change. Include information that might be important for security discussions, surfaces risks and can be used throughout the life cycle of the proposal. E.g. include security-relevant design decisions, concerns, important discussions, implementation-specific guidance and pitfalls, an outline of threats and risks and how they are being addressed. EIP submissions missing the "Security Considerations" section will be rejected. An EIP cannot proceed to status "Final" without a Security Considerations discussion deemed sufficient by the reviewers.-->
+
+TBA
+
+## Copyright
+Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).