Implement removal of orphan nodes #37

roysc · 2021-06-20T11:24:25Z

Refactors the storage of values into a separate KV store, indexed by hash(value)+key. As noted in #14 loses the benefit of deduplication, while enabling pruning of records.

Completes work from #24.
Revises work from #36.

sha256(testKey) = 0001... sha256(testKey2) = 1000... sha256(foo) = 0010...

no case produces nil sideNodes; nothing modifies the stored hashes

Store values in a separate mapstore indexed by (hash(value)+key); allows pruning values Simplify pruning loops slightly as well

adlerjohn · 2021-06-20T16:53:19Z

deepsubtree.go

 	if !result {
 		return ErrBadProof
 	}

+	if valueHash != nil {
+		if err := dsmst.values.Set(valueKey(key, valueHash), value); err != nil {


Here's the first use, but conceptually is there a reason to map hash(key, hash(value)) -> value? Why not just hash(key) -> value (i.e. path -> value)? Mapping path -> value allows retrieval to the value knowing only the key, which is necessary for e.g. constant-time reads.

With just key/path -> value, there is no way to retrieve values for past versions. I considered using root+key as @tzdybal suggested, but then lookup requires knowing the root at the value's last update. Seems we would need some sort of copy-on-write versioned backing store to make this efficient.

note - this is just key+hash(value) -> value, no extra hash.

Ah, that comment is outdated. Snapshotting and retrieving past version is outside the scope of this library. It can be accomplished at the db level. See https://github.com/cosmos/cosmos-sdk/blob/6a5a2de798d4e61302783ae84ecfe2958c7088d5/docs/architecture/adr-040-storage-and-smt-state-commitments.md.

Seems we would need some sort of copy-on-write versioned backing store to make this efficient.

Potentially yes, and that's an interesting direction, but as above it's outside the scope of this library. The caller can accomplish this by versioning the mapstores they pass in on SMT initialization.

TL;DR: use path -> value.

The caller can accomplish this by versioning the mapstores they pass in on SMT initialization.

This work is actually in support of that ADR - we'll be doing exactly that, so I'm basically indifferent to the indexing used here. I just didn't want to introduce breakage.

If just path is used, supporting past value access will be dropped, but I assume proofs for non-pruned roots should still be supported? Otherwise pruning effectively becomes the only behavior.

If just path is used, supporting past value access will be dropped, but I assume proofs for non-pruned roots should still be supported? Otherwise pruning effectively becomes the only behavior.

Good point. Proofs against past versions of the tree isn't needed in this library, since that can be accomplished caller-side. So, pruning should be the only behavior (making the prune flag unnecessary).

Proofs against old roots is a test case specified in the current version of this library. If this feature is to be removed because it's considered out of scope, we should be sure that this can easily be replicated caller-side with a KV store with snapshot support. What KV store supports this exactly? I checked RocksDB, but its snapshot feature is basically copying the entire DB directory, which clearly isn't efficient.

Or would the idea be that the caller can implement versioning themselves on top of any efficient KV store like this or this? If so, this doesn't seem to be suggested in the ADR, which seems to assume that the underlying KV store natively supports snapshotting.

The ADR states that the state commitment (i.e. the SMT, i.e. this library) isn't responsible for snapshots and historical queries, but rather the state storage is. Note that the state storage won't be a raw KV store, but will be a wrapper over one or more raw KV stores with additional logic.

State Storage requirements:

range queries

quick (key, value) access

creating a snapshot

historical versioning

pruning (garbage collection)

State Commitment requirements:

fast updates

tree path should be short

pruning (garbage collection)

While there has been some benchmarking done on Rocks DB etc snapshots for the ADR, this isn't a widely supported feature by KV stores, and it's yet to be seen how well this will work in practice. RocksDB snapshots seem to be designed with backups use case in mind; creating a new directory for every block could exhaust the inode limit on filesystems, given a chain with millions of blocks. BadgerDB seems to be more suitable as it's a key-value-version store, but this seems to be the only KV that supports this.

This library should act a general-purpose library. If someone wants to use this library for testing or research for example, they shouldn't have to setup complex extra scaffolding - or be tied to BadgerDB (the only KV store that seems to support efficient versioning) - to get the basic feature of accessing previous roots.

For flexibility, this library should ideally support the following two extremes:

Prune everything.

Prune nothing.

Anything in between could be implemented by the caller-supplied KV store; in the simple case, this could be delaying Del operations until e.g. 100 blocks. This would allow us to keep the ProveForRoot etc operations.

This could be implement by using a path -> value KV for this PR, and making pruning optional. The path -> value store (for efficient read ops), would only be for the latest version of the tree. However, the SMT would still need to store valueHash -> value in the nodes MapStore, to allow accessing value for old roots when pruning is disabled. When GetForRoot is called, it would get the leaf by traversing the tree directly, rather than using the path -> value store.

Alternatively: this library could support "prune everything" only, but keep storing valueHash -> value in the nodes MapStore (until pruned).

It would then be trivial for the caller to implement a "prune nothing" with MapStore scaffolding that simply does nothing when Del is called. GetForRoot and ProveForRoot can then still work. However, those functions would only work for callers that have used a MapStore with this hack.

This is probably the better / easiest option.

Hmm, actually I just realised that valueHash -> value can't be stored in either proposals, due to the ambiguity issue when pruning. So we need to think of another way to implement pruning while ideally allow for faster Get operations, without forcing people to use BadgerDB or implement their own versioning scheme, if they want to disable pruning.

smt.go

musalbas · 2021-06-21T18:34:01Z

In light of the above discussion, I'm favouring using the original PR #36 and changing the specification for the following reasons:

Using a path -> value store for faster gets, should actually be the responsibility of the caller, not the SMT, which should only be responsible for the tree itself. Hence, the idea of separating the state commitment library from the storage library. Adding this KV store, would cause them to be coupled, and increase complexity.
Accepting the above, adding a new path -> value store just to implement pruning is wasteful, when we can just make a breaking change to the spec.
(See discussion above.) Without introducing a breaking change or a lot of extra complexity, it doesn't seem possible to allow people to optionally disable pruning, without forcing them to use BadgerDB (the only well-known key-value-version store), or implement their own versioning, which also requires a key-value store implementation that supports efficient range queries. This is not desirable.
Structuring the tree so that values don't have ambiguity to which leaf they belong to for implementation purposes, seems like something that should be in the spec anyway.

adlerjohn · 2021-06-22T01:10:15Z

Adding a summary of some nuance discussed on Discord.

More important than breaking the specs (which is unimportant in the grand scheme of things) the real issue is that the current hash(value) -> value map, along with the key || hash(value) -> value map that replaces it in this PR (and the equivalent in #36), are redundant to the key -> value map that will be used in ADR-40 to allow for constant-time leaf reads. Redundant = space overhead that must be paid by all users of the library, not just users that want historical queries.

In other words, regardless of pruning orphaned nodes, this abstraction must be reworked from the current assumption that a single KV exists for both internal nodes and leaves to the assumption that two KVs exist, one for internal nodes and one for leaves. This second abstract coincidentally aligns exactly with the requirements for constant-time reads, and thus does not require any redundancy.

A suggested alternative would be to store key -> hash(value), then hash(value) -> value, adding a level of indirection. But this would then require reads to do two sequential disk accesses instead of one to get a leaf in constant time, which would be an enormous overhead.

Unfortunately, removing redundancy/this mandatory overhead means it becomes impossible to support historical queries, and pruning orphaned nodes must be enabled at all times.

roysc · 2021-06-22T06:07:35Z

This may be a silly question, but what about just removing Get value access entirely? The tree could just store its internal structure and value hashes, assuming only the responsibility of proof generation. That would completely avoid redundancy and leave value storage and versioning up to the caller.

tzdybal · 2021-06-22T08:59:20Z

This may be a silly question, but what about just removing Get value access entirely? The tree could just store its internal structure and value hashes, assuming only the responsibility of proof generation.

This will completely change how this library works. Right now it's a self-contained, general purpose Merkle Tree. Removing Get is a fundamental change, meaning that library can't be used "alone", without external store for values.

That would completely avoid redundancy and leave value storage and versioning up to the caller.

This will only push the problem up to the caller ;)

I agree with @adlerjohn that two KV stores should be used in library (which of course can be implemented as just prefix-buckets in single database).

roysc · 2021-06-23T03:58:24Z

This will only push the problem up to the caller ;)

Well, that's the idea. The library user can optimize for whatever they need, and the library doesn't have to try to solve for all cases. You could still provide a self-contained tree map that includes value access via one of the methods discussed. Anyway, just a thought.

musalbas · 2021-06-25T13:30:53Z

I agree that given a trade-off between duplicating data storage, remove the Get functionality, and removing the ability to access old state roots, the latter is the least undesirable.

If there's no other options, I'm in favour of having two KV stores, where one of the KV stores is a path -> value store, and dropping the functionality of accessing old state roots.

musalbas

I think we have consensus to change smv to a path -> value store, and drop the functionality of accessing old state roots within this library.

tac0turtle · 2021-07-02T09:52:21Z

What is the timeline for the proposed changes. I would like to coordinate an audit of the library soonish.

can short circuit when the same value is already set

roysc · 2021-07-02T11:21:09Z

@marbar3778 changes are pushed :) I will tag the owners for re-review.

deepsubtree.go

Co-authored-by: John Adler <adlerjohn@users.noreply.github.com>

proofs.go

stale

musalbas · 2021-07-08T12:57:38Z

The core PR LGMT, however there is some other things that should be do as a result of this PR:

Update the example in the readme.
Remove the API to get old roots.
Update Get so that it reads from the path->value store directly, instead of traversing the tree.

adlerjohn · 2021-07-08T13:05:14Z

I'll handle these in follow-up PRs so as to unblock @roysc. Good work on this PR!

* Orphans removal - simple test * Add Option type and AutoRemoveOrphans Using variadic arguments is very non-intrusive way to introduce options. This is very generic, and will allow to implement other options in the future. * Add missing comment * Implement removal of orphan nodes (#37) * Implement orphan pruning * Correct tests comments sha256(testKey) = 0001... sha256(testKey2) = 1000... sha256(foo) = 0010... * Remove redundant checks and copies no case produces nil sideNodes; nothing modifies the stored hashes * Refactor value storage Store values in a separate mapstore indexed by (hash(value)+key); allows pruning values Simplify pruning loops slightly as well * Fix array copy * Avoid extra key param * Fix idempotent Set can short circuit when the same value is already set * Deprecate past root access, remove prune as option * Index values by just path * Cleanup * Update deepsubtree.go Co-authored-by: John Adler <adlerjohn@users.noreply.github.com> * verifyProofWithUpdates - Fix unused return value Co-authored-by: John Adler <adlerjohn@users.noreply.github.com> Co-authored-by: Roy Crihfield <roy@manteia.ltd> Co-authored-by: John Adler <adlerjohn@users.noreply.github.com> Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com>

roysc added 5 commits June 17, 2021 14:19

Implement orphan pruning

3b026d7

Correct tests comments

c355649

sha256(testKey) = 0001... sha256(testKey2) = 1000... sha256(foo) = 0010...

Remove redundant checks and copies

bb7c2b6

no case produces nil sideNodes; nothing modifies the stored hashes

Refactor value storage

a51be73

Store values in a separate mapstore indexed by (hash(value)+key); allows pruning values Simplify pruning loops slightly as well

Fix array copy

48e4ab9

This was referenced Jun 20, 2021

Automatic removal of orphan nodes #36

Closed

Automatic removal of orphan nodes vulcanize/smt#2

Closed

adlerjohn reviewed Jun 20, 2021

View reviewed changes

Avoid extra key param

a207801

musalbas mentioned this pull request Jun 21, 2021

Automatic removal of orphan nodes #24

Merged

5 tasks

musalbas previously requested changes Jun 25, 2021

View reviewed changes

roysc added 4 commits July 2, 2021 19:13

Fix idempotent Set

989c5e7

can short circuit when the same value is already set

Deprecate past root access, remove prune as option

37010f3

Index values by just path

92ecfa6

Cleanup

afcb9fe

roysc requested review from musalbas and adlerjohn July 2, 2021 11:21

adlerjohn reviewed Jul 4, 2021

View reviewed changes

deepsubtree.go Outdated Show resolved Hide resolved

Update deepsubtree.go

cef0b26

Co-authored-by: John Adler <adlerjohn@users.noreply.github.com>

adlerjohn approved these changes Jul 5, 2021

View reviewed changes

musalbas reviewed Jul 6, 2021

View reviewed changes

proofs.go Outdated Show resolved Hide resolved

verifyProofWithUpdates - Fix unused return value

c73c878

adlerjohn requested a review from musalbas July 6, 2021 21:46

musalbas approved these changes Jul 8, 2021

View reviewed changes

adlerjohn merged commit 763105b into celestiaorg:tzdybal/remove_orhpans Jul 8, 2021

This was referenced Jul 11, 2021

Fetch values immediately, updates #39

Merged

Use raw key as path instead of hash(key) #40

Open

i-norden mentioned this pull request Sep 11, 2021

ADR-040 Implementation: Hook state store and SMT into the SDK cosmos/cosmos-sdk#10117

Closed

7 tasks

liamsi mentioned this pull request Sep 16, 2021

Minor improvements #60

Merged

roysc deleted the tzdybal/remove_orhpans branch April 11, 2022 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement removal of orphan nodes #37

Implement removal of orphan nodes #37

roysc commented Jun 20, 2021

adlerjohn Jun 20, 2021

roysc Jun 20, 2021

adlerjohn Jun 20, 2021

roysc Jun 20, 2021

adlerjohn Jun 20, 2021

musalbas Jun 21, 2021 •

edited

Loading

adlerjohn Jun 21, 2021

musalbas Jun 21, 2021 •

edited

Loading

musalbas Jun 21, 2021 •

edited

Loading

musalbas Jun 21, 2021

musalbas commented Jun 21, 2021 •

edited

Loading

adlerjohn commented Jun 22, 2021 •

edited by musalbas

Loading

roysc commented Jun 22, 2021

tzdybal commented Jun 22, 2021

roysc commented Jun 23, 2021

musalbas commented Jun 25, 2021

musalbas left a comment

tac0turtle commented Jul 2, 2021

roysc commented Jul 2, 2021

musalbas commented Jul 8, 2021

adlerjohn commented Jul 8, 2021

Implement removal of orphan nodes #37

Implement removal of orphan nodes #37

Conversation

roysc commented Jun 20, 2021

adlerjohn Jun 20, 2021

Choose a reason for hiding this comment

roysc Jun 20, 2021

Choose a reason for hiding this comment

adlerjohn Jun 20, 2021

Choose a reason for hiding this comment

roysc Jun 20, 2021

Choose a reason for hiding this comment

adlerjohn Jun 20, 2021

Choose a reason for hiding this comment

musalbas Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

adlerjohn Jun 21, 2021

Choose a reason for hiding this comment

musalbas Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

musalbas Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

musalbas Jun 21, 2021

Choose a reason for hiding this comment

musalbas commented Jun 21, 2021 • edited Loading

adlerjohn commented Jun 22, 2021 • edited by musalbas Loading

roysc commented Jun 22, 2021

tzdybal commented Jun 22, 2021

roysc commented Jun 23, 2021

musalbas commented Jun 25, 2021

musalbas left a comment

Choose a reason for hiding this comment

tac0turtle commented Jul 2, 2021

roysc commented Jul 2, 2021

musalbas commented Jul 8, 2021

adlerjohn commented Jul 8, 2021

musalbas Jun 21, 2021 •

edited

Loading

musalbas Jun 21, 2021 •

edited

Loading

musalbas Jun 21, 2021 •

edited

Loading

musalbas commented Jun 21, 2021 •

edited

Loading

adlerjohn commented Jun 22, 2021 •

edited by musalbas

Loading