-
Notifications
You must be signed in to change notification settings - Fork 53
Implement removal of orphan nodes #37
Implement removal of orphan nodes #37
Conversation
sha256(testKey) = 0001... sha256(testKey2) = 1000... sha256(foo) = 0010...
no case produces nil sideNodes; nothing modifies the stored hashes
Store values in a separate mapstore indexed by (hash(value)+key); allows pruning values Simplify pruning loops slightly as well
deepsubtree.go
Outdated
if !result { | ||
return ErrBadProof | ||
} | ||
|
||
if valueHash != nil { | ||
if err := dsmst.values.Set(valueKey(key, valueHash), value); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the first use, but conceptually is there a reason to map hash(key, hash(value)) -> value
? Why not just hash(key) -> value
(i.e. path -> value
)? Mapping path -> value
allows retrieval to the value knowing only the key, which is necessary for e.g. constant-time reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With just key/path -> value
, there is no way to retrieve values for past versions. I considered using root+key
as @tzdybal suggested, but then lookup requires knowing the root at the value's last update. Seems we would need some sort of copy-on-write versioned backing store to make this efficient.
note - this is just key+hash(value) -> value
, no extra hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that comment is outdated. Snapshotting and retrieving past version is outside the scope of this library. It can be accomplished at the db level. See https://github.com/cosmos/cosmos-sdk/blob/6a5a2de798d4e61302783ae84ecfe2958c7088d5/docs/architecture/adr-040-storage-and-smt-state-commitments.md.
Seems we would need some sort of copy-on-write versioned backing store to make this efficient.
Potentially yes, and that's an interesting direction, but as above it's outside the scope of this library. The caller can accomplish this by versioning the mapstores they pass in on SMT initialization.
TL;DR: use path -> value
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller can accomplish this by versioning the mapstores they pass in on SMT initialization.
This work is actually in support of that ADR - we'll be doing exactly that, so I'm basically indifferent to the indexing used here. I just didn't want to introduce breakage.
If just path
is used, supporting past value access will be dropped, but I assume proofs for non-pruned roots should still be supported? Otherwise pruning effectively becomes the only behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If just path is used, supporting past value access will be dropped, but I assume proofs for non-pruned roots should still be supported? Otherwise pruning effectively becomes the only behavior.
Good point. Proofs against past versions of the tree isn't needed in this library, since that can be accomplished caller-side. So, pruning should be the only behavior (making the prune
flag unnecessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proofs against old roots is a test case specified in the current version of this library. If this feature is to be removed because it's considered out of scope, we should be sure that this can easily be replicated caller-side with a KV store with snapshot support. What KV store supports this exactly? I checked RocksDB, but its snapshot feature is basically copying the entire DB directory, which clearly isn't efficient.
Or would the idea be that the caller can implement versioning themselves on top of any efficient KV store like this or this? If so, this doesn't seem to be suggested in the ADR, which seems to assume that the underlying KV store natively supports snapshotting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ADR states that the state commitment (i.e. the SMT, i.e. this library) isn't responsible for snapshots and historical queries, but rather the state storage is. Note that the state storage won't be a raw KV store, but will be a wrapper over one or more raw KV stores with additional logic.
State Storage requirements:
- range queries
- quick (key, value) access
- creating a snapshot
- historical versioning
- pruning (garbage collection)
State Commitment requirements:
- fast updates
- tree path should be short
- pruning (garbage collection)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While there has been some benchmarking done on Rocks DB etc snapshots for the ADR, this isn't a widely supported feature by KV stores, and it's yet to be seen how well this will work in practice. RocksDB snapshots seem to be designed with backups use case in mind; creating a new directory for every block could exhaust the inode limit on filesystems, given a chain with millions of blocks. BadgerDB seems to be more suitable as it's a key-value-version store, but this seems to be the only KV that supports this.
This library should act a general-purpose library. If someone wants to use this library for testing or research for example, they shouldn't have to setup complex extra scaffolding - or be tied to BadgerDB (the only KV store that seems to support efficient versioning) - to get the basic feature of accessing previous roots.
For flexibility, this library should ideally support the following two extremes:
- Prune everything.
- Prune nothing.
Anything in between could be implemented by the caller-supplied KV store; in the simple case, this could be delaying Del operations until e.g. 100 blocks. This would allow us to keep the ProveForRoot etc operations.
This could be implement by using a path -> value KV for this PR, and making pruning optional. The path -> value store (for efficient read ops), would only be for the latest version of the tree. However, the SMT would still need to store valueHash -> value in the nodes MapStore, to allow accessing value for old roots when pruning is disabled. When GetForRoot is called, it would get the leaf by traversing the tree directly, rather than using the path -> value store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively: this library could support "prune everything" only, but keep storing valueHash -> value in the nodes MapStore (until pruned).
It would then be trivial for the caller to implement a "prune nothing" with MapStore scaffolding that simply does nothing when Del is called. GetForRoot and ProveForRoot can then still work. However, those functions would only work for callers that have used a MapStore with this hack.
This is probably the better / easiest option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually I just realised that valueHash -> value can't be stored in either proposals, due to the ambiguity issue when pruning. So we need to think of another way to implement pruning while ideally allow for faster Get operations, without forcing people to use BadgerDB or implement their own versioning scheme, if they want to disable pruning.
In light of the above discussion, I'm favouring using the original PR #36 and changing the specification for the following reasons:
|
Adding a summary of some nuance discussed on Discord. More important than breaking the specs (which is unimportant in the grand scheme of things) the real issue is that the current In other words, regardless of pruning orphaned nodes, this abstraction must be reworked from the current assumption that a single KV exists for both internal nodes and leaves to the assumption that two KVs exist, one for internal nodes and one for leaves. This second abstract coincidentally aligns exactly with the requirements for constant-time reads, and thus does not require any redundancy. A suggested alternative would be to store Unfortunately, removing redundancy/this mandatory overhead means it becomes impossible to support historical queries, and pruning orphaned nodes must be enabled at all times. |
This may be a silly question, but what about just removing |
This will completely change how this library works. Right now it's a self-contained, general purpose Merkle Tree. Removing
This will only push the problem up to the caller ;) I agree with @adlerjohn that two KV stores should be used in library (which of course can be implemented as just prefix-buckets in single database). |
Well, that's the idea. The library user can optimize for whatever they need, and the library doesn't have to try to solve for all cases. You could still provide a self-contained tree map that includes value access via one of the methods discussed. Anyway, just a thought. |
I agree that given a trade-off between duplicating data storage, remove the Get functionality, and removing the ability to access old state roots, the latter is the least undesirable. If there's no other options, I'm in favour of having two KV stores, where one of the KV stores is a path -> value store, and dropping the functionality of accessing old state roots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have consensus to change smv to a path -> value store, and drop the functionality of accessing old state roots within this library.
What is the timeline for the proposed changes. I would like to coordinate an audit of the library soonish. |
can short circuit when the same value is already set
@marbar3778 changes are pushed :) I will tag the owners for re-review. |
Co-authored-by: John Adler <adlerjohn@users.noreply.github.com>
The core PR LGMT, however there is some other things that should be do as a result of this PR:
|
I'll handle these in follow-up PRs so as to unblock @roysc. Good work on this PR! |
* Orphans removal - simple test * Add Option type and AutoRemoveOrphans Using variadic arguments is very non-intrusive way to introduce options. This is very generic, and will allow to implement other options in the future. * Add missing comment * Implement removal of orphan nodes (#37) * Implement orphan pruning * Correct tests comments sha256(testKey) = 0001... sha256(testKey2) = 1000... sha256(foo) = 0010... * Remove redundant checks and copies no case produces nil sideNodes; nothing modifies the stored hashes * Refactor value storage Store values in a separate mapstore indexed by (hash(value)+key); allows pruning values Simplify pruning loops slightly as well * Fix array copy * Avoid extra key param * Fix idempotent Set can short circuit when the same value is already set * Deprecate past root access, remove prune as option * Index values by just path * Cleanup * Update deepsubtree.go Co-authored-by: John Adler <adlerjohn@users.noreply.github.com> * verifyProofWithUpdates - Fix unused return value Co-authored-by: John Adler <adlerjohn@users.noreply.github.com> Co-authored-by: Roy Crihfield <roy@manteia.ltd> Co-authored-by: John Adler <adlerjohn@users.noreply.github.com> Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com>
Refactors the storage of values into a separate KV store, indexed by
hash(value)+key
. As noted in #14 loses the benefit of deduplication, while enabling pruning of records.Completes work from #24.
Revises work from #36.