Dynamic state snapshots #20152

karalabe · 2019-10-04T08:59:31Z

Note, this PR is semi-experimental work. All code included has been extensively tested on live nodes, but it is very very very sensitive code. As such, the PR hides the included logic behind --snapshot. We've decided to merge it to get the code onto master as it's closing in on the 6 month development mark already.

This PR creates a secondary data structure for storing the Ethereum state, called a snapshot. This snapshot is special as it dynamically follows the chain and can also handle small-ish reorgs:

At the very bottom, the snapshot consists of a disk layer, which is essentially a semi-recent full flat dump of the account and storage contents. This is stored in LevelDB as a <hash> -> <account> mapping for the account trie and <account-hash><slot-hash> -> <slot-value> mapping for the storage tries. The layout permits fast iteration over the accounts and storage, which will be used for a new sync algorithm.
Above the disk layer there is a tree of in-memory diff layers that each represent one block's worth of state mutations. Every time a new block is processed, it is linked on top of the existing diff tree, and the bottom layers flattened together to keep the maximum tree depth reasonable. At the very bottom, the first diff layer acts as an accumulator which only gets flattened into the disk layer when it outgrows it's memory allowance. This is done mostly to avoid thrashing LevelDB.

The snapshot can be built fully online, during the live operation of a Geth node. This is harder than it seems because rebuilding the snapshot for mainnet takes 9 hours, during which the in-memory garbage collection long deletes the state needed for a single capture.

The PR achieves this by gradually iterating the state tries and maintaining a marker to the account/storage slot position until which the snapshot was already generated. Every time a new block is executed, state mutations prior to the marker get applied directly (the ones afterwards get discarded) and the snapshot builder switches to iterating the new root hash.
To handle reorgs, the builder operates on HEAD-128 and is capable of suspending/resuming if a state is missing (a restart will only write out some tries, not all cached in memory).

The benefit of the snapshot is that it acts as an acceleration structure for state accesses:

Instead of doing O(log N) disk reads (+leveldb overhead) to access an account / storage slot, the snapshot can provide direct, O(1) access time. This should be a small improvement in block processing and a huge improvement in eth_call evaluations.
The snapshot supports account and storage iteration at O(1) complexity per entry + sequential disk access, which should enable remote nodes to retrieve state data significantly cheaper than before (the sort order is the state trie leaf order, so responses can directly be assembled into tries too).
The presence of the snapshot can also enable more exotic use cases such as deleting and rebuilding the entire state trie (guerilla pruning) as well as building alternative state trie (e.g. binary vs. hexary), which might be needed in the future.

The downside of the snapshot is that the raw account and storage data is essentially duplicated. In the case of mainnet, this means an extra 15GB of SSD space used.

core/state/snapshot/difflayer.go

holiman · 2019-10-10T09:11:12Z

core/state/snapshot/disklayer.go

+	}
+	// Cache doesn't contain account, pull from disk and cache for later
+	blob := rawdb.ReadAccountSnapshot(dl.db, hash)
+	dl.cache.Set(key, blob)


I'm torn on whether we really should cache nil-items here...

core/state/snapshot/account.go

karalabe · 2019-10-10T11:48:29Z

Hi Sam

…

On Thu, Oct 10, 2019, 18:29 Martin Holst Swende ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In core/state/snapshot/account.go <#20152 (comment)> : > +// along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>. + +package snapshot + +import ( + "bytes" + "math/big" + + "github.com/ethereum/go-ethereum/common" + "github.com/ethereum/go-ethereum/rlp" +) + +// Account is a slim version of a state.Account, where the root and code hash +// are replaced with a nil byte slice for empty accounts. +type Account struct { + Nonce uint64 Considering that there's only been 500M transactions made, do we really need uint64 here? A uint32 would be 'safe' through 4.2 billion transactions, at least... ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20152?email_source=notifications&email_token=AAA7UGPNFZXNCP2IA3G5VYLQN3YX7A5CNFSM4I5ND4EKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHQKAHQ#pullrequestreview-299933726>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7UGKCCASTNWOA66YFF3DQN3YX7ANCNFSM4I5ND4EA> .

holiman · 2019-10-12T03:20:33Z

Another couple of points,

Right now, if each block modifies roughly 2000 items (accounts+storage), and we have 128 layers,
And each layer represents one block, whereas the bottom layer (before disk) represents 200 blocks, that means
We'll have 128 layers of N size, and 1 layer of 200 * N size.
Eventually, we'll flush the bottom-most layer, and at that point we have 128 blocks in memory, essentially (going down from e..g 350).

Now, if we have M bytes of memory availalbe, it seems to me that it would make more sense to have a gradual slope of memory usage, instead of N, N, ...200 * N . Instead have N, N, 2N, 2N, 4N...64N (totalling 254N in this example). The consequence would be the folowing:

Upside: When flushing the lowest layer, we'd not lose the majority of blocks from memory, only a smaller portion. So smaller fluctuations in performance,
Upside: When accessing an item, we'd not iterate through 128 layers, only ~14
Downside: When reorg:ing, or accessing a particular block state, we'd have to potentially re-execute some blocks (maybe up to 63 blocks). However, if tuned well, this would not happen on mainnet.

karalabe · 2019-10-12T04:10:52Z

core/state/snapshot/difflayer.go

+		dl.lock.Lock()
+		defer dl.lock.Unlock()
+
+		dl.parent = parent.flatten()


We might need some smarter locking here. If the parents have side branches, those don't get locked by the child's lock.

holiman · 2019-10-12T04:10:57Z

core/state/snapshot/snapshot.go

+// Snapshot represents the functionality supported by a snapshot storage layer.
+type Snapshot interface {
+	// Info returns the block number and root hash for which this snapshot was made.
+	Info() (uint64, common.Hash)


It would be 'nicer' if the Info returns the span of blocks that a snapshot represents, and not just the last block

I'm actually thinking of nuking the whole number tracking. Clique gets messy because roots are not unique across blocks. We could make it root + number, but that would entail statedb needing to add block numbers to every single method, which gets messy fast.

Alternatively, we can just not care about block number and instead just track child -> parent hierarchies. This would mean that we could end up with a lot more state "cached" in memory than 128 if there are ranges of empty clique blocks, but those wouldn't add any new structs, just keep the existing ones around for longer, so should be fine.

core/state/snapshot/difflayer.go

holiman · 2019-10-17T17:45:30Z

core/state/snapshot/difflayer.go

+func (dl *diffLayer) Journal() error {
+	dl.lock.RLock()
+	defer dl.lock.RUnlock()
+


Perhaps we should check dl.stale here and error out if set

actually, probably better to do it in journal, since one of the parents might be stale, not just this level

holiman · 2019-10-23T13:20:25Z

core/state/snapshot/snapshot.go

+		// If we still have diff layers below, recurse
+		if parent, ok := diff.parent.(*diffLayer); ok {
+			return st.cap(parent, layers-1, memory)


Now that it's standalone and not internal to a difflayer, would be nicer to iterate instead of recurse, imo

holiman · 2019-10-23T15:23:00Z

core/state/snapshot/difflayer_journal.go

+		}
+		writer = file
+	}
+	// Everything below was journalled, persist this layer too


Suggested change

// Everything below was journalled, persist this layer too

if dl.stale{

return nil, ErrSnapshotStale

}

// Everything below was journalled, persist this layer too

holiman · 2019-10-23T16:29:24Z

core/state/snapshot/difflayer_journal.go

+	}
+	// If we haven't reached the bottom yet, journal the parent first
+	if writer == nil {
+		file, err := dl.parent.(*diffLayer).journal()


Right now, we obtain the readlock in Journal -- but here we're calling the parent journal directly, bypassing the lock-taking.
We should remove the locking from Journal and do it in this method instead, right after the parent is done writing

holiman · 2019-11-26T13:43:15Z

core/state/snapshot/disklayer.go

+
+	// If the layer is being generated, ensure the requested hash has already been
+	// covered by the generator.
+	if dl.genMarker != nil && bytes.Compare(key, dl.genMarker) > 0 {


This seems to work, but would be more obvious with the following change:

Suggested change

if dl.genMarker != nil && bytes.Compare(key, dl.genMarker) > 0 {

if dl.genMarker != nil && bytes.Compare(accountHash[:], dl.genMarker) > 0 {

core/state/snapshot/disklayer_generate.go

holiman · 2019-11-27T09:09:23Z

I don't understand... At the end of this method, we return a new diskLayer. Who remembers these accounts that were left stranded in this layer and not copied to disk?

Answering myself: we the new diskLayer doesn't contain it, so the caller will later have to resolve it from the trie, which is fine... I think?

holiman · 2019-11-27T12:16:34Z

core/state/snapshot/generate.go

+
+		speed := done/uint64(time.Since(gs.start)/time.Millisecond+1) + 1 // +1s to avoid division by zero
+		ctx = append(ctx, []interface{}{
+			"eta", common.PrettyDuration(time.Duration(left/speed) * time.Millisecond),


Could you add percentage too? I think that's a pretty user-friendly thing to have. Basically 100 * binary.BigEndian.Uint32(marker[:4]) / uint32(-1))

holiman · 2020-03-02T09:37:12Z

core/state/statedb.go

@@ -468,6 +466,10 @@ func (s *StateDB) updateStateObject(obj *stateObject) {

 	// If state snapshotting is active, cache the data til commit
 	if s.snap != nil {
+		// If the account is an empty resurrection, unmark the storage nil-ness
+		if storage, ok := s.snapStorage[obj.addrHash]; storage == nil && ok {


I'm not convinced this is correct. There are a couple of things that can happen:

Old code (before CREATE2): A contract with storage is selfdestructed in tx n. In tx n+1, someone sends a wei to the address, and the account is recreated. The desired end-state should be that the storage has become nil and the account exists.

New code, withe CREATE2: A contract is killed in tx n. In tx n+1, the contract is recreated, and the initcode sets new storaege slots. So the old storage slots are all cleared, and there are now new storage slots set. We need to handle this (we don't currently)

gballet

LGTM, a couple comments here and there, as well as a question about future-proofing the PR.

gballet · 2020-03-09T10:01:30Z

core/state/snapshot/snapshot.go

+	Account(hash common.Hash) (*Account, error)
+
+	// AccountRLP directly retrieves the account RLP associated with a particular
+	// hash in the snapshot slim data format.
+	AccountRLP(hash common.Hash) ([]byte, error)
+
+	// Storage directly retrieves the storage data associated with a particular hash,
+	// within a particular account.
+	Storage(accountHash, storageHash common.Hash) ([]byte, error)


This isn't a showstopper for merging the PR, merely a question regarding future evolutions of Ethereum: there is an EIP (haven't found it yet) that suggests merging the account and storage tries. This kept recurring in stateless 1.x discussions. It would be a good idea to have a more generic method like GetBlobAtHash(f) inerface{}, taking a function f to deserialize the blob.

gballet · 2020-03-09T10:03:14Z

core/state/snapshot/snapshot.go

+	//
+	// Note, the method is an internal helper to avoid type switching between the
+	// disk and diff layers. There is no locking involved.
+	Parent() snapshot


if it's an internal helper, then it shouldn't be public

gballet · 2020-03-09T10:14:58Z

core/state/snapshot/snapshot.go

+// The goal of a state snapshot is twofold: to allow direct access to account and
+// storage data to avoid expensive multi-level trie lookups; and to allow sorted,
+// cheap iteration of the account/storage tries for sync aid.
+type Tree struct {


Change name to match comment, or vice-versa.

* core/state/snapshot/iterator: fix two disk iterator flaws * core/rawdb: change SnapshotStoragePrefix to avoid prefix collision with preimagePrefix

fruor · 2020-04-17T07:50:29Z

As it's now released, wouldn't it make sense to update the Command Line Options wiki? The --snapshot switch is completely missing there

karalabe · 2020-04-17T08:05:35Z

@fruor No, the feature will be on by default soon-ish. We just don't want yet people to use it as it might still change a bit. It's there if someone wants to test it (e.g. our benchmarker runs).

karalabe requested review from gballet, holiman, rjl493456442 and zsfelfoldi as code owners October 4, 2019 08:59

karalabe mentioned this pull request Oct 4, 2019

[WIP] Dynamic state snapshot #20009

Closed

holiman reviewed Oct 6, 2019

View reviewed changes

core/state/snapshot/difflayer.go Show resolved Hide resolved

holiman reviewed Oct 6, 2019

View reviewed changes

core/state/snapshot/difflayer.go Show resolved Hide resolved

holiman reviewed Oct 10, 2019

View reviewed changes

core/state/snapshot/difflayer.go Outdated Show resolved Hide resolved

holiman reviewed Oct 10, 2019

View reviewed changes

core/state/snapshot/account.go Show resolved Hide resolved

karalabe commented Oct 12, 2019

View reviewed changes

holiman reviewed Oct 12, 2019

View reviewed changes

karalabe commented Oct 12, 2019

View reviewed changes

core/state/snapshot/difflayer.go Show resolved Hide resolved

karalabe commented Oct 12, 2019

View reviewed changes

core/state/snapshot/difflayer.go Outdated Show resolved Hide resolved

holiman reviewed Oct 17, 2019

View reviewed changes

holiman reviewed Oct 23, 2019

View reviewed changes

adamschmideg added the status:work-in-progress label Nov 12, 2019

karalabe force-pushed the snapshot-5 branch 4 times, most recently from 0b8d955 to 0f86812 Compare November 26, 2019 13:41

holiman reviewed Nov 26, 2019

View reviewed changes

core/state/snapshot/disklayer_generate.go Outdated Show resolved Hide resolved

karalabe force-pushed the snapshot-5 branch from 9809fd1 to 4cdddb7 Compare November 27, 2019 09:35

holiman reviewed Nov 27, 2019

View reviewed changes

core/state: fix an account resurrection issue

92ec07d

holiman reviewed Mar 2, 2020

View reviewed changes

holiman and others added 3 commits March 2, 2020 13:46

core/tests: test for destroy+recreate contract with storage

361a6f0

squashme

fe8347e

core/state/snapshot, tests: sync snap gen + snaps in consensus tests

6e05ccd

karalabe force-pushed the snapshot-5 branch from 4b257db to 6e05ccd Compare March 3, 2020 07:17

karalabe added 3 commits March 3, 2020 15:52

core/state: extend snapshotter to handle account resurrections

a4cf279

core/state: fix account root hash update point

dcb22a9

core/state: fix resurrection state clearing and access

328de18

karalabe force-pushed the snapshot-5 branch from 641e415 to 328de18 Compare March 4, 2020 08:23

holiman and others added 3 commits March 4, 2020 14:38

core/state/snapshot: handle deleted accounts in fast iterator

eff7cfb

core: more blockchain tests

bc5d742

core/state/snapshot: fix various iteration issues due to destruct set

fab0ee3

gballet reviewed Mar 9, 2020

View reviewed changes

karalabe modified the milestones: 1.9.12, 1.9.13 Mar 16, 2020

core: fix two snapshot iterator flaws, decollide snap storage prefix

074efe6

* core/state/snapshot/iterator: fix two disk iterator flaws * core/rawdb: change SnapshotStoragePrefix to avoid prefix collision with preimagePrefix

karalabe force-pushed the snapshot-5 branch from d645b8f to 074efe6 Compare March 23, 2020 10:34

karalabe merged commit 613af7c into ethereum:master Mar 23, 2020

adamschmideg mentioned this pull request Apr 2, 2020

tests: TestState is flaky on Travis #20873

Closed

karalabe mentioned this pull request Jun 8, 2020

[feature] Offline generation of Parity Warp Sync Snapshots #21186

Closed

ehnuje mentioned this pull request Jan 7, 2021

database: added Delete and Replay to Batch interface klaytn/klaytn#841

Merged

9 tasks

ricardolyn mentioned this pull request Feb 4, 2021

[Upgrade] Go-Ethereum release v1.9.13 Consensys/quorum#1123

Merged

15 tasks

i-norden mentioned this pull request Jul 27, 2021

Expose leaf node data vulcanize/smt#3

Closed

DarianShawn mentioned this pull request Mar 20, 2023

Dynamic state snapshots dogechain-lab/dogechain#326

Closed

4 tasks

endale98 mentioned this pull request Jul 17, 2023

Core/snapshot: Dynamic state snapshots BuildOnViction/victionchain#372

Closed

trinhdn2 mentioned this pull request Aug 16, 2023

[Non-breaking changes] Dynamic state snapshot BuildOnViction/victionchain#379

Closed

1 task

rkhadem4 approved these changes Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic state snapshots #20152

Dynamic state snapshots #20152

karalabe commented Oct 4, 2019 •

edited

Loading

holiman Oct 10, 2019

karalabe commented Oct 10, 2019 via email

holiman commented Oct 12, 2019 •

edited

Loading

karalabe Oct 12, 2019

holiman Oct 12, 2019

karalabe Oct 23, 2019

holiman Oct 17, 2019

holiman Oct 23, 2019

holiman Oct 23, 2019

holiman Oct 23, 2019 •

edited

Loading

holiman Oct 23, 2019

holiman Nov 26, 2019

holiman commented Nov 27, 2019

holiman Nov 27, 2019

holiman Mar 2, 2020

gballet left a comment

gballet Mar 9, 2020

gballet Mar 9, 2020

gballet Mar 9, 2020

fruor commented Apr 17, 2020

karalabe commented Apr 17, 2020

	if dl.genMarker != nil && bytes.Compare(key, dl.genMarker) > 0 {
	if dl.genMarker != nil && bytes.Compare(accountHash[:], dl.genMarker) > 0 {

Dynamic state snapshots #20152

Dynamic state snapshots #20152

Conversation

karalabe commented Oct 4, 2019 • edited Loading

Choose a reason for hiding this comment

karalabe commented Oct 10, 2019 via email

holiman commented Oct 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holiman Oct 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holiman commented Nov 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gballet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fruor commented Apr 17, 2020

karalabe commented Apr 17, 2020

karalabe commented Oct 4, 2019 •

edited

Loading

holiman commented Oct 12, 2019 •

edited

Loading

holiman Oct 23, 2019 •

edited

Loading