Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic state snapshots #20152

Merged
merged 28 commits into from
Mar 23, 2020
Merged

Dynamic state snapshots #20152

merged 28 commits into from
Mar 23, 2020

Conversation

karalabe
Copy link
Member

@karalabe karalabe commented Oct 4, 2019

Note, this PR is semi-experimental work. All code included has been extensively tested on live nodes, but it is very very very sensitive code. As such, the PR hides the included logic behind --snapshot. We've decided to merge it to get the code onto master as it's closing in on the 6 month development mark already.


This PR creates a secondary data structure for storing the Ethereum state, called a snapshot. This snapshot is special as it dynamically follows the chain and can also handle small-ish reorgs:

  • At the very bottom, the snapshot consists of a disk layer, which is essentially a semi-recent full flat dump of the account and storage contents. This is stored in LevelDB as a <hash> -> <account> mapping for the account trie and <account-hash><slot-hash> -> <slot-value> mapping for the storage tries. The layout permits fast iteration over the accounts and storage, which will be used for a new sync algorithm.
  • Above the disk layer there is a tree of in-memory diff layers that each represent one block's worth of state mutations. Every time a new block is processed, it is linked on top of the existing diff tree, and the bottom layers flattened together to keep the maximum tree depth reasonable. At the very bottom, the first diff layer acts as an accumulator which only gets flattened into the disk layer when it outgrows it's memory allowance. This is done mostly to avoid thrashing LevelDB.

The snapshot can be built fully online, during the live operation of a Geth node. This is harder than it seems because rebuilding the snapshot for mainnet takes 9 hours, during which the in-memory garbage collection long deletes the state needed for a single capture.

  • The PR achieves this by gradually iterating the state tries and maintaining a marker to the account/storage slot position until which the snapshot was already generated. Every time a new block is executed, state mutations prior to the marker get applied directly (the ones afterwards get discarded) and the snapshot builder switches to iterating the new root hash.
  • To handle reorgs, the builder operates on HEAD-128 and is capable of suspending/resuming if a state is missing (a restart will only write out some tries, not all cached in memory).

The benefit of the snapshot is that it acts as an acceleration structure for state accesses:

  • Instead of doing O(log N) disk reads (+leveldb overhead) to access an account / storage slot, the snapshot can provide direct, O(1) access time. This should be a small improvement in block processing and a huge improvement in eth_call evaluations.
  • The snapshot supports account and storage iteration at O(1) complexity per entry + sequential disk access, which should enable remote nodes to retrieve state data significantly cheaper than before (the sort order is the state trie leaf order, so responses can directly be assembled into tries too).
  • The presence of the snapshot can also enable more exotic use cases such as deleting and rebuilding the entire state trie (guerilla pruning) as well as building alternative state trie (e.g. binary vs. hexary), which might be needed in the future.

The downside of the snapshot is that the raw account and storage data is essentially duplicated. In the case of mainnet, this means an extra 15GB of SSD space used.

}
// Cache doesn't contain account, pull from disk and cache for later
blob := rawdb.ReadAccountSnapshot(dl.db, hash)
dl.cache.Set(key, blob)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on whether we really should cache nil-items here...

@karalabe
Copy link
Member Author

karalabe commented Oct 10, 2019 via email

@holiman
Copy link
Contributor

holiman commented Oct 12, 2019

Another couple of points,

  • Right now, if each block modifies roughly 2000 items (accounts+storage), and we have 128 layers,
  • And each layer represents one block, whereas the bottom layer (before disk) represents 200 blocks, that means
  • We'll have 128 layers of N size, and 1 layer of 200 * N size.
  • Eventually, we'll flush the bottom-most layer, and at that point we have 128 blocks in memory, essentially (going down from e..g 350).

Now, if we have M bytes of memory availalbe, it seems to me that it would make more sense to have a gradual slope of memory usage, instead of N, N, ...200 * N . Instead have N, N, 2N, 2N, 4N...64N (totalling 254N in this example). The consequence would be the folowing:

  • Upside: When flushing the lowest layer, we'd not lose the majority of blocks from memory, only a smaller portion. So smaller fluctuations in performance,
  • Upside: When accessing an item, we'd not iterate through 128 layers, only ~14
  • Downside: When reorg:ing, or accessing a particular block state, we'd have to potentially re-execute some blocks (maybe up to 63 blocks). However, if tuned well, this would not happen on mainnet.

dl.lock.Lock()
defer dl.lock.Unlock()

dl.parent = parent.flatten()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need some smarter locking here. If the parents have side branches, those don't get locked by the child's lock.

// Snapshot represents the functionality supported by a snapshot storage layer.
type Snapshot interface {
// Info returns the block number and root hash for which this snapshot was made.
Info() (uint64, common.Hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be 'nicer' if the Info returns the span of blocks that a snapshot represents, and not just the last block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking of nuking the whole number tracking. Clique gets messy because roots are not unique across blocks. We could make it root + number, but that would entail statedb needing to add block numbers to every single method, which gets messy fast.

Alternatively, we can just not care about block number and instead just track child -> parent hierarchies. This would mean that we could end up with a lot more state "cached" in memory than 128 if there are ranges of empty clique blocks, but those wouldn't add any new structs, just keep the existing ones around for longer, so should be fine.

func (dl *diffLayer) Journal() error {
dl.lock.RLock()
defer dl.lock.RUnlock()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should check dl.stale here and error out if set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, probably better to do it in journal, since one of the parents might be stale, not just this level

Comment on lines 232 to 234
// If we still have diff layers below, recurse
if parent, ok := diff.parent.(*diffLayer); ok {
return st.cap(parent, layers-1, memory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that it's standalone and not internal to a difflayer, would be nicer to iterate instead of recurse, imo

}
writer = file
}
// Everything below was journalled, persist this layer too
Copy link
Contributor

@holiman holiman Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Everything below was journalled, persist this layer too
if dl.stale{
return nil, ErrSnapshotStale
}
// Everything below was journalled, persist this layer too

}
// If we haven't reached the bottom yet, journal the parent first
if writer == nil {
file, err := dl.parent.(*diffLayer).journal()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we obtain the readlock in Journal -- but here we're calling the parent journal directly, bypassing the lock-taking.
We should remove the locking from Journal and do it in this method instead, right after the parent is done writing


// If the layer is being generated, ensure the requested hash has already been
// covered by the generator.
if dl.genMarker != nil && bytes.Compare(key, dl.genMarker) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work, but would be more obvious with the following change:

Suggested change
if dl.genMarker != nil && bytes.Compare(key, dl.genMarker) > 0 {
if dl.genMarker != nil && bytes.Compare(accountHash[:], dl.genMarker) > 0 {

core/state/snapshot/disklayer_generate.go Outdated Show resolved Hide resolved
@holiman
Copy link
Contributor

holiman commented Nov 27, 2019

I don't understand... At the end of this method, we return a new diskLayer. Who remembers these accounts that were left stranded in this layer and not copied to disk?

Answering myself: we the new diskLayer doesn't contain it, so the caller will later have to resolve it from the trie, which is fine... I think?


speed := done/uint64(time.Since(gs.start)/time.Millisecond+1) + 1 // +1s to avoid division by zero
ctx = append(ctx, []interface{}{
"eta", common.PrettyDuration(time.Duration(left/speed) * time.Millisecond),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add percentage too? I think that's a pretty user-friendly thing to have. Basically 100 * binary.BigEndian.Uint32(marker[:4]) / uint32(-1))

@@ -468,6 +466,10 @@ func (s *StateDB) updateStateObject(obj *stateObject) {

// If state snapshotting is active, cache the data til commit
if s.snap != nil {
// If the account is an empty resurrection, unmark the storage nil-ness
if storage, ok := s.snapStorage[obj.addrHash]; storage == nil && ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is correct. There are a couple of things that can happen:

  • Old code (before CREATE2): A contract with storage is selfdestructed in tx n. In tx n+1, someone sends a wei to the address, and the account is recreated. The desired end-state should be that the storage has become nil and the account exists.

  • New code, withe CREATE2: A contract is killed in tx n. In tx n+1, the contract is recreated, and the initcode sets new storaege slots. So the old storage slots are all cleared, and there are now new storage slots set. We need to handle this (we don't currently)

Copy link
Member

@gballet gballet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple comments here and there, as well as a question about future-proofing the PR.

Comment on lines +101 to +109
Account(hash common.Hash) (*Account, error)

// AccountRLP directly retrieves the account RLP associated with a particular
// hash in the snapshot slim data format.
AccountRLP(hash common.Hash) ([]byte, error)

// Storage directly retrieves the storage data associated with a particular hash,
// within a particular account.
Storage(accountHash, storageHash common.Hash) ([]byte, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a showstopper for merging the PR, merely a question regarding future evolutions of Ethereum: there is an EIP (haven't found it yet) that suggests merging the account and storage tries. This kept recurring in stateless 1.x discussions. It would be a good idea to have a more generic method like GetBlobAtHash(f) inerface{}, taking a function f to deserialize the blob.

//
// Note, the method is an internal helper to avoid type switching between the
// disk and diff layers. There is no locking involved.
Parent() snapshot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's an internal helper, then it shouldn't be public

// The goal of a state snapshot is twofold: to allow direct access to account and
// storage data to avoid expensive multi-level trie lookups; and to allow sorted,
// cheap iteration of the account/storage tries for sync aid.
type Tree struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name to match comment, or vice-versa.

@karalabe karalabe modified the milestones: 1.9.12, 1.9.13 Mar 16, 2020
* core/state/snapshot/iterator: fix two disk iterator flaws

* core/rawdb: change SnapshotStoragePrefix to avoid prefix collision with preimagePrefix
@karalabe karalabe merged commit 613af7c into ethereum:master Mar 23, 2020
@fruor
Copy link

fruor commented Apr 17, 2020

As it's now released, wouldn't it make sense to update the Command Line Options wiki? The --snapshot switch is completely missing there

@karalabe
Copy link
Member Author

@fruor No, the feature will be on by default soon-ish. We just don't want yet people to use it as it might still change a bit. It's there if someone wants to test it (e.g. our benchmarker runs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants