Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

bulk child trie deletion #5280

Closed
wants to merge 171 commits into from
Closed

Conversation

cheme
Copy link
Contributor

@cheme cheme commented Mar 17, 2020

This Pr is built upon #4827.

It implements bulk child deletion: instead of deleting every element of a child trie one by one with kill_child_storage, the deletion is done as a whole.
This means:

  • only a single item is stored when deleting a child trie (the identifier of the child trie).
  • A single change trie item is produce for the whole child trie deletion, this is done in a breaking way but I do not think change trie is currently use. I do not know if the item should appear in the digest too. cc @svyatonik
  • on pruning deletion is done by using the key prefix used by rocksdb to avoid key collision. This Pr is needed for this : Delete by prefix operator in kvdb parity-common#360
  • pruning journal an noncanonical journal need to store this change, and a different format is use. This change of format means that it may be possible to use the first approach of Manage child trie content independently #4827 to gain a bit of space (pushing the rocksdb prefix to key value later in order to reduce journal key redundancy by managing changes per child trie). cc @arkpar

This PR depending on #4827 first, I put its status in progress, but it should be usable for contract testing cc @Robbepop @pepyakin .

cheme added 30 commits January 28, 2020 09:45
Note that KeyspacedDB are still use.
KeyspacedDBMut only for test.
allocation whise unless we break all api for a close to nothing perf
change: switching to simple single child info struct.
@cheme
Copy link
Contributor Author

cheme commented May 7, 2020

I did start implementing the possibility to remove deleted nodes from proof like describe in previous comment master...cheme:proof_backends .

The code is incomplete (no handle of child trie prefix deletion), and need many tests, but enough to formulate a few interesting observation.

I end up with three kind of proof construction and two kind of proof verification.
With the bulk trie deletion approach proof did not need those different execution code, because not registering the bulk deletion was specific operation.

  • proof that do not need to trace deletion that are bellow a parent deletion, and therefore can produce shorter proof:
    • A first variant that do not process the transaction. (WithoutTransaction)

This should be the default choice unless we need to use the transactions (performance).
The implementation uses a different logic at Ext level and additional filtering in the overlay change (deleted child trie and deleted prefix).
On root calculation we simply calculate child trie root on a empty backend.
Note that for the prefix case, we are using a hack to insert the unitary changes lazilly. That does not achieve the real goal (on proof calculation we will register the access to the unitary element), and new trie primitive would be needed to only change the prefix node. Still the lazy nature of this hack benefits proofs that do not call storage root.

    • A second variant that process the transaction but do not record it. (WithTransaction)

The implementation simply switch of the proof node recording during prefixed key access and during child trie key access on delete prefix and delete child trie operation.
But internally for storage root calculation it needs to work the same way as the previous variant and suffer from the same issue on prefix case.

Both variant needs to be verified with a verifier that do not query child trie nodes on deletion, that is the same change in processsing than the first variant.
The drawback here is that an host function could not be use in the proof (except if we do not prove its value as it is currently doable with Core_execute_block behavior), changes_storage_root that compute the change trie root: getting those deletion out of the proof we cannot check change trie, that can be acceptable (note that with bulk deletion approach all will be fine because change trie do not expect those deletion entries).

  • proof that need to all deletion access (the existing code where we register all trie node touched):

Interesting for testing.
Interesting for proving a change trie. Should parachain verify their change trie in polkadot, that does not seem very obvious to me (at first I was thinking 'no use', but we also probably want to trust change trie).

So my gist out of this branch would be to say that it is doable to skip unneeded deleted content in proof, but rely on the assumption that change trie root will never be read during block processing.
Generally I find it better to have a special change for trie deletion (and by extension prefix deletion), and do not see disadvantage from having it (except complexity but if we got some specific proof generation process that is not true at all).
Benefit being lighter change trie and smaller memory footprint.

In both scenario (the linked branch and this PR approach), delete by prefix optimization would require additional tools (radix trie map and specific trie update operation, plus some reverse deathrow logic on non archive node (this would be needed for bulk child trie deletion too if we would allow using a child trie again after deletion)).

@cheme cheme added B3-apinoteworthy C1-low PR touches the given topic and has a low impact on builders. labels Jun 11, 2020
@athei athei mentioned this pull request Jun 23, 2020
7 tasks
@athei
Copy link
Member

athei commented Jul 3, 2020

What is the status here?
@cheme You need feedback (if yes by whom)?

I am asking because we need this for #6357.

@cheme
Copy link
Contributor Author

cheme commented Jul 3, 2020

It is frozen, I need a feedback on wether we want to make child trie deletion (and by extension possibly prefix deletion) a first class citizen operation (currently we got insert key value and delete key in changesets).

I mean @rphmeier did reply that we would not include child trie key in proofs, so I started implemented that, but end up having complicated redundant code (state machine need to behave differently depending on context), and I still think that registering a specific child trie deletion operation is the easier and cleaner thing to do.

From a quick look at the changes, things that could be trimmed:

  • remove the InnerHash code, it is an unrelated optimisation
  • revert Manage child trie content independently #4827 change (but we still need to add an additional child trie related info in client and organising by child trie makes thing a bit more evolutive (bringing child_info into the client db layer to allow different backend depending on child type)). This relates to discussion with @arkpar and what is referred too as a 'totally different architecture'.

Only way I see to avoid child info in client would be to create a new storage change that contains the bulk child deletion with either keyspace for rocksdb delete or fetched child trie node hash for parity db (but it would means we fetch child trie content at the time of block processing which makes things awkward with proof, and make use store useless journal of key when we only need root).
In this case that would be the db type that will be leaking in state machine, and we need to avoid fetching key when running a child deletion in statemachine for proof, so I think this idea is a bad move.

@athei
Copy link
Member

athei commented Jul 3, 2020

You outlined multiple possible solutions. You have problems deciding which to choose or why is it frozen? @rphmeier @arkpar can you comment which one you would prefer? I have to little knowledge on this topic to fully comprehend the answer.

@arkpar
Copy link
Member

arkpar commented Jul 6, 2020

I still think subtree deletion should not leak to state-db abstraction. It should really be backend agnostic, and not rely on prefixes or backend-provided identifiers. ParityDb uses reference-counting, so that same node could be shared by multiple tries. Saving a lot of space potentially. This does not play well with prefixed keys. The most generic way to implement it is to walk the tree and delete each individual key. Avoiding any special handling of this operation. I'd suggest we don't do any DB-optimized deletion until it is shown to be a problem in practice.

Regarding proofs: Pausing recording during the deletion operation looks like the right thing to do. It is an optimization though. The validator would not access nodes when it validates deletion, so there's no need to provide them. I'm not sure I understand the issue with the change trie. How does skipping collecting proof for deletion affects the change trie root? As far as I know, change tries are only implemented for the main tree.

The long-term plan for ParityDB is to evolve to support trie-related operations. That would involve not just deletion, but also reference-counted insertion and iteration at the DB level. Until then we'd better keep it simple.

@rphmeier
Copy link
Contributor

rphmeier commented Jul 6, 2020

I feel too disconnected from this issue and the underlying DB roadmap to meaningfully comment. I believe the only place I have weighed in is here upon this request from @arkpar to clarify needs of some higher-level protocol. I defer to @arkpar for comments on the general approach.

@athei
Copy link
Member

athei commented Jul 6, 2020

The most generic way to implement it is to walk the tree and delete each individual key. Avoiding any special handling of this operation. I'd suggest we don't do any DB-optimized deletion until it is shown to be a problem in practice.

This is basically what we have now, right? You argue for keeping child trie deletion a O(n) operation?

@cheme
Copy link
Contributor Author

cheme commented Jul 7, 2020

ParityDb uses reference-counting, so that same node could be shared by multiple tries. Saving a lot of space potentially. This does not play well with prefixed keys.

The PR do not use prefixed keys in this case: for rocksdb the drop tree operation contains both child trie root and keyspace, then only in the case of rocksdb we try to apply a delete by prefix, parity-db still iterate on trie but only when pruning is called.
At first I thought it would be useless with parity-db, but the good things are:

  • only keeping trace of child root in the pruning journal.
  • in case of an archive node it means that we never iterate the child trie at all
  • smaller change trie (but a bit less info which in this case looks good to me)
    Still those point are either optimization (I don't expect the size of the journal to be problematic), or breaking (change trie).

I'd suggest we don't do any DB-optimized deletion until it is shown to be a problem in practice

Using trie iteration for rocksdb (the same way as parity-db) seems doable, it will then only helps with the previous point.

Pausing recording during the deletion operation looks like the right thing to do.

That is what I did in master...cheme:proof_backends , it means that there is different mode of execution for state-machine, one that produce the transaction payload, and one that don't (IIRC I did add a third one for doing both (tx payload and proof not recording ), I believe it is what is needed for cumulus).
I don't remember too well why I was really not happy with this branch, maybe bringing the same logic for delete_with_prefix was a bit too much, or I just end up seeing the new assumption with change trie and other things as more complicated than having a delete trie change set operation.

I'm not sure I understand the issue with the change trie.

Change trie root can be calculated from a storage host function so if a parachain uses it in its runtime, it would need to include all deletion in its proof (so running the proof add all the child trie deletion in change trie).
In practice the value of the change trie root would probably be directly send to the header without being use internally by the state machine and not stored in the state trie, but it is still possible to do.
Similarily including individual child trie deletion in change trie means that if we want to proof child trie the proof will allways be way bigger (as is the child trie)

As far as I know, change tries are only implemented for the main tree.

I reply to quickly to previous point, change trie do also include child trie changes.

The long-term plan for ParityDB is to evolve to support trie-related operations. That would involve not just deletion, but also reference-counted insertion and iteration at the DB level. Until then we'd better keep it simple.

Sounds interesting, I will try to keep looking at parity-db branches and issues if there is.

@cheme
Copy link
Contributor Author

cheme commented Jul 7, 2020

You argue for keeping child trie deletion a O(n) operation?

Note that in the parity-db case we still got a O(n) trie parsing removal operation. Regarding the perf gains the trade off seems fine.
The issue I see, and I think is your concern regarding rent system, is that we iterate during block evaluation when with this PR we only need to iterate on pruning (which makes sure this O(n) cost only happen for canonical and is easier to separate or reschedule).

@athei
Copy link
Member

athei commented Jul 7, 2020

The issue I have with O(n) child trie removal is when contracts are evicted due to rent. If the contract has a huge amount of accumulated storage this eviction might be even too large for one block. We then need to implement something like lazy removal that can distribute this across multiple blocks.

@arkpar
Copy link
Member

arkpar commented Jul 7, 2020

The most generic way to implement it is to walk the tree and delete each individual key. Avoiding any special handling of this operation. I'd suggest we don't do any DB-optimized deletion until it is shown to be a problem in practice.

This is basically what we have now, right? You argue for keeping child trie deletion a O(n) operation?

It is O(n) best case anyway. The only difference is with rocksdb you get O(n) on commit, and not on the actual delete operation. Unless you keep each subtree in a separate database. We can indeed optimize to perform it in the background on the database level but I'd argue it is too early for that.

@arkpar
Copy link
Member

arkpar commented Jul 7, 2020

I think it does not make much sense for change tries to collect all removed keys for subtries. If it does, there's your O(n) right there and there's little sense to optimize it in the database, since you have to do iteration anyway.
There needs to be a single event recorded in the parent tree that would allow to track that all subtree data has been removed.

@athei
Copy link
Member

athei commented Jul 7, 2020

I see. Thanks for your answers. We then need cope with the fact that eviction is linear time operation.

@Robbepop
Copy link
Contributor

Robbepop commented Jul 7, 2020

From my current understanding if we cannot get around the fact that contract eviction is linear time we should at least make it possible to perform the clean-up lazily or defer it when possible or break it up into chunks of clean-up. Otherwise we end up having no way to clean-up contract instances that are especially big. Might this even be an attack vector in certain cases? Maybe there are other operations that need to be deferred or chunked up into pieces to be distributed among several blocks so that we could find a generalized solution to this problem?

@gnunicorn
Copy link
Contributor

closing because of inactivity.

@gnunicorn gnunicorn closed this Sep 9, 2020
@cheme
Copy link
Contributor Author

cheme commented Sep 9, 2020

There needs to be a single event recorded in the parent tree that would allow to track that all subtree data has been removed.

Using a specific child trie delete operation does inherently cover this point.

Closing note:
bulk child trie deletion seems still very relevant to me.
Though it is more in line with an architecture where each child trie are managed independently (which is needed for multiple other things and open perspective for other kind of state).
So in line with #4827 I feel like this will be needed in the future (I feel like the switching off deletion when registering proof is more complex than this approach).

@athei athei mentioned this pull request Nov 6, 2020
@cheme
Copy link
Contributor Author

cheme commented Nov 23, 2020

Also for #6594 having a single child trie delete info send back to the client makes the non O(1) cost only happen at pruning (or never for archive nodes), and it could be easier to implement this way.

@cheme
Copy link
Contributor Author

cheme commented Apr 8, 2021

Note, this PR lacks a correct trie root calculation in state-machine, code such as https://github.com/cheme/trie/blob/379bf92720e23b2465badd4739c7d2ba75115c06/trie-db/src/traverse.rs#L658 (old code please don't use), mainly have a detach branch from trie action (can be implemented in triedbmut too, but more involve).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A3-in_progress Pull request is in progress. No review needed at this stage. C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.