-
Notifications
You must be signed in to change notification settings - Fork 20.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
les: historical data garbage collection #19570
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I know the codebase enough to review, generally looks good to me
a3b41ae
to
7aa7475
Compare
7aa7475
to
cd89f63
Compare
cd89f63
to
4024180
Compare
4024180
to
d79678c
Compare
les/pruner.go
Outdated
defer p.wg.Done() | ||
|
||
var ( | ||
last = p.checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that every time the client is started you begin deleting old chain data starting from the checkpoint? This feels a bit wasteful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea: you could change Prune to use a db iterator to find and delete everything in the db from the beginning to the given section instead of iterating with a for loop on every block number of the specific section. This would have multiple advantages:
- you do not need to remember what was pruned already
- it is efficient because chain data db keys are prefixed with block number (this was one of the reasons I did it like this)
- this way you can also prune cached chain data that was ODRed after pruning that section. For example log searching can download many old receipts and it is nice to cache them for a while but it would also be good to throw them away eventually. If you just clean everything before the new section every time a new section is processed then I think it is good enough.
les/pruner.go
Outdated
return | ||
} | ||
// Always keep the latest section data in database. | ||
for i := last + 1; i < min-1; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic does not handle indexer rollbacks. While this is not a frequent case we do handle it in the indexers (might happen on testnets/private nets). Rolling back while pruning is a bit tricky though (you would have to restore the headers of the current unfinished section) so I think it is fine to not be able to roll back properly if pruning is enabled. In this case an error message would be nice at least. And/or automatically reverting to the last stable checkpoint and resyncing from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'll keep the latest section in db. It means at least 32768+2048 headers are kept.
Do we really have that deep reorg even in the testnet? Maybe it's possible in the ropsten...
Now we don't have a very good solution to restore all pruned chain data. Rewind HEAD to checkpoint(or geneisis) seems feasible. I may try this approach.
Find a critial issue.
|
d79678c
to
13d3395
Compare
38055b8
to
0914adb
Compare
@zsfelfoldi I change the code a bit. One important thing is I reserve all hash->number mappings in the database since they are necessary for hash based APIs. From the storage size, the size of 1 section mappings is about 1.4MB. So I think it's totally acceptable. Please take another look :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comments about the iterators (the rest is fine now).
b54a316
to
bfeeac0
Compare
core/rawdb/accessors_chain.go
Outdated
// ReadAllCanonicalHashes retrieves all canonical number and hash mappings at the | ||
// certain chain range. Note, this method should only used in pruned light client | ||
// otherwise the cost can be very expensive. | ||
func ReadAllCanonicalHashes(db ethdb.Iteratee, from uint64, to uint64) ([]uint64, []common.Hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This iterator implementation seems a bit suboptimal to me. If I call this for mainnet with all blocks present, for from = 8000000
and to = from + 1
, you will iterate over 8M entries until you find the starting item. I guess if the use case is light client pruning you can expect all initial keys to be missing, but it still seems suboptimal.
Wouldn't a better solution be to use: start = headerHashKey(from);
end = headerHashKey(to)and then simply iterate with
NewIteratorWithStart(start)and terminate when
bytes.Compare(key, end) >= 0? We could also swap out
NewIteratorWithStartto
NewIteratorWithRange` to make this code even simpler and the iterator even more flexible.
@@ -697,7 +697,7 @@ func (db *Database) Cap(limit common.StorageSize) error { | |||
// | |||
// Note, this method is a non-synchronized mutator. It is unsafe to call this | |||
// concurrently with other mutators. | |||
func (db *Database) Commit(node common.Hash, report bool) error { | |||
func (db *Database) Commit(node common.Hash, report bool, callback func(common.Hash)) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we want to prune CHT and Bloom Trie. Since the trie nodes of the latest section are enough for generating next Merkle root. So I use the callback to collect all trie nodes of the current section and delete all other.
f65e1f7
to
7d8be12
Compare
@karalabe Fixed, ptal |
f1cc133
to
e5a76a0
Compare
e5a76a0
to
9001100
Compare
Please remove the flag. This feature can just be on by default. |
9001100
to
796f8a5
Compare
This change introduces garbage collection for the light client. Historical chain data is deleted periodically. If you want to disable the GC, use the --light.nopruning flag.
This PR introduces garbage collection feature for light client. If you want to disable the GC, please add flag --light.nopruning
Now we have several types of data in light client which can be pruned:
Since light client can generate the new cht root and bloom trie root in runtime, so all historical chain data is unnecessary for light client.
If we need to prove something with a pruned headers, light client will fetch it again with latest CHT root which covers all historical headers.
Here are some GC results:
Original client:
GCed client
With this feature, we can control the storage size of light client into a fixed value.