Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/state/snapshot: clean up dangling storages in snapshot generation #24665

Closed
wants to merge 8 commits into from

Conversation

rjl493456442
Copy link
Member

@rjl493456442 rjl493456442 commented Apr 8, 2022

This PR improves the snapshot generation, it will detect the dangling storages in snapshot generation.

Dangling storage means the storage snapshot data is existent while the corresponding account snapshot
data is missing. It's possible to happen in such scenario(text copied from Peter):


During snap sync, we pull in pieces of storage slots, but between two cycles, the contract might get deleted. In that case if the deletion happens in between an interrupt, I think, the old slots will end up on disk, but the old account not... neither the new one which got deleted. So we have storage slots, but no accounts on top.

After snap sync finishes, we need to regenerate the state snapshot. So we use our batched account prover to fix any issues in the accounts. The account range may or may not be correct - and we will fix it - however, we will not care about the deleted account since it's not part of the snapshot account range.

Then after fixing the account range, we verify each of the account's storage slots - each existing one that is, or freshly deleted ones. But we don't ever consider dangling storage data without an account.


This PR fixes this issue by introducing a storage scanner. It will be constructed for each account range which detects all the dangling accounts and caches them in memory(the size is limited, no OOM issue). Constructing scanner is not trivial which might involve lots of disk reads, but it's per account range so it's still acceptable. The detected dangling storages will be wiped at the correct time during the generation.

@rjl493456442
Copy link
Member Author

Note, this PR is not tested on live network, should be verified first.

@rjl493456442
Copy link
Member Author

rjl493456442 commented Apr 11, 2022

It's verified on Mainnet and Goerli, no dangling storage detected though.

@holiman
Copy link
Contributor

holiman commented Apr 25, 2022

Constructing scanner is not trivial which might involve lots of disk reads, but it's per account range so it's still acceptable. The detected dangling storages will be wiped at the correct time during the generation.

I've read the code a bit, but it's still not quite clear to me what we actually scan. Will this (in total) scan the entire slot key-range, or will it only scan the portions of the keyspace that reside between account ranges?

I suspect it's the former.

@rjl493456442
Copy link
Member Author

It will only scan the portions of the keyspace that reside between account ranges.

During the snapshot generation process, we will take a batch of accounts as the unit. It's same for dangling storage detector here. A single range detector is constructed for a batch of accounts(by default it's 128 accounts) and all the dangling storages will be detected out in a single database iteration.

Why I say Constructing scanner is not trivial? Since whenever we construct a leveldb iterator and shift the iteration cursor to the start position, this procedure already involves a bunch of db reads.

@holiman
Copy link
Contributor

holiman commented Apr 26, 2022

Sorry, it seems I made this require a rebase?

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a review-call about this, I find it hard to grokk fully :)

)

// DanglingRange describes the range for detecting dangling storages.
type DanglingRange struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type DanglingRange struct {
type DanglingRangeScanner struct {

perhaps?

defer iter.Release()

for iter.Next() {
account := iter.Key()[len(rawdb.SnapshotStoragePrefix) : len(rawdb.SnapshotStoragePrefix)+common.HashLength]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the way you have written it is "more correct", but IMO it's easier to read on the form

account := iter.Key()[1:33] // trim 1-byte prefix, read 32-byte hash

Might be a minority opinion though ...

Comment on lines +550 to +552
if err := drange.cleanup(nil); err != nil {
return false, nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you do a cleanup, but I don't see where you actually ever invoke detect ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect is invoked when the range is constructed.

@rjl493456442
Copy link
Member Author

I think we should have a review-call about this, I find it hard to grokk fully :)

Let's do it

if err := result.forEach(func(key []byte, val []byte) error { return onState(key, val, false, false) }); err != nil {
var drange *DanglingRange
if kind != "storage" {
drange = NewDanglingRange(dl.diskdb, origin, result.last(), false)
Copy link
Contributor

@holiman holiman Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if result.last() can be wrong. Example:

The first range is 0x000 - 0x00F. It has the following :

0x001: an account + storage
0x002: an account + storage
0x003: an account + storage
0x00a: an account + storage

Then most of them go away, and we're left with only one account in this range:

0x001: account + storage

Won't result.last be 0x001, making it so we don't scan 0x001 - 0x00F ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you say Then most of them go away? If we iterate out four accounts 0x001, 0x002, 0x003, 0x00a, then result.last will be 0x00a. It represents the account snapshot in range 0x001 - 0x00a is checked by generator. And the next range should start from 0x00b.

Copy link
Member Author

@rjl493456442 rjl493456442 Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for these four accounts, three of them are not in trie(dangling account snapshot data) which will be removed from the disk, the result.last should still be 0x00a.

@rjl493456442
Copy link
Member Author

Close because of #24811

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants