Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating data availability roots from shard block roots #1438

Open
vbuterin opened this issue Oct 19, 2019 · 2 comments
Open

Generating data availability roots from shard block roots #1438

vbuterin opened this issue Oct 19, 2019 · 2 comments
Labels

Comments

@vbuterin
Copy link
Contributor

vbuterin commented Oct 19, 2019

In a beacon block, we have N data roots r[1] ... r[n] (in the worst case, N = 256), each of which represent data d[1] ... d[n] of different sizes: on average 128 kB, in the worst case 512 kB. We want to combine these roots into a combined root R = f(r[1] ... r[n]) which is a Merkle root of some kind of concatenation of d[1] ...d[n].

The main desiderata are:

  • R can be computed from r[1] ... r[n] either directly or without adding much auxiliary data
  • Minimal "wasted space" in R (consider that if we force all r[i] to have the same height, on average there will be 75% wasted space)

Possibilities I see:

  1. Allow r[i] to have different heights depending on its size, so there's on average ~30% maximum 49% wasted space. Then create R by combining all r[i] values of each size, adding a bit of padding if needed and then combining the different sizes. Main weakness is that variable heights are not SSZ-friendly.
  2. Expose a maximum of 4 data sub-roots where r[i] = root(r[i][0], r[i][1], r[i][2], r[i][3]). Reduces wasted space even further, and is cleaner (including being SSZ compliant), but increases number of block hashes from 1 to average 1.5 max 4, increasing beacon chain load by 1.2x average-case and 2.33x worst case.
  3. Expose 3 data sub-roots where r[i][0] is the root of the first half of the data if the data is more than half the max size and otherwise zero, r[i][1] is the root of the first quarter of the data not covered by r[i][0], and r[i][2] is the root of the first eighth of the data not covered by r[i][1]. Reduces wasted space heavily (maybe ~10% wasted), but increases beacon chain load ~1.89x, and is more complicated (including non-SSZ-friendliness from (1)).

For (1) here is an example algorithm:

def combine_roots(roots: List[Hash], sizes: List[int]) -> Hash:
    roots = []
    cur_size = 1
    ZERO_HASH = b'\x00' * 32
    while cur_size < max(sizes) * 2 or len(roots) > 1:
        if len(roots) % 2:
            roots.append(ZERO_HASH)
        roots = [hash(roots[i] + roots[i+1]) for i in range(0, len(roots), 2)]
        ZERO_HASH = hash(ZERO_HASH + ZERO_HASH)
        roots.extend([roots[i] for i in range(len(roots)) if cur_size//2 < sizes[i] <= cur_size])
    return roots[0]

For (2), here are concrete estimates of data overhead. With 64 shards, and a MAX_CATCHUP_RATIO of 4, and 4x slack for different attestations, there are a maximum of 1024 data roots in a block, plus 1024 state roots and 1024 lengths (that's 32 + 32 + 8 = 72 kB). Proposal (2) would increase the former to 1536 data roots hence 48 + 32 + 8 = 88 kB average case and 160 + 32 + 8 = 168 kB worst case. Note that average-case figures are expected to be ~8x less than these worst-case figures.

I am currently favoring (2) but open to other ideas.

@dankrad
Copy link
Contributor

dankrad commented Oct 20, 2019

I think (2) sounds very reasonable. Could we save a lot of beacon chain load by making the target shard block size just a little bit smaller than 128 kB, say something in the range of 100-120 kB? I we are expecting that the distribution around the target does not have a huge variance, then this would make a lot fewer of the second data roots necessary (not sure if there is an obvious way to get the expected distribution from EIP1559?)

@vbuterin
Copy link
Contributor Author

I do expect the distribution to have a considerable variance! Sure, if you assume it's a normal distribution, on average a block has ~120 txs so the standard deviation should be ~10, but that ignores uneven sizes of transactions, multi-transactions, etc. If I had to guess I'd expect the 25-75 confidence interval to be ~0.6-1.5 times the average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants