Skip to content
This repository has been archived by the owner on Feb 27, 2023. It is now read-only.

Implement removal of orphan nodes #37

Merged

Conversation

roysc
Copy link
Contributor

@roysc roysc commented Jun 20, 2021

Refactors the storage of values into a separate KV store, indexed by hash(value)+key. As noted in #14 loses the benefit of deduplication, while enabling pruning of records.

Completes work from #24.
Revises work from #36.

sha256(testKey)  = 0001...
sha256(testKey2) = 1000...
sha256(foo)      = 0010...
no case produces nil sideNodes; nothing modifies the stored hashes
Store values in a separate mapstore indexed by (hash(value)+key);
allows pruning values

Simplify pruning loops slightly as well
deepsubtree.go Outdated
if !result {
return ErrBadProof
}

if valueHash != nil {
if err := dsmst.values.Set(valueKey(key, valueHash), value); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the first use, but conceptually is there a reason to map hash(key, hash(value)) -> value? Why not just hash(key) -> value (i.e. path -> value)? Mapping path -> value allows retrieval to the value knowing only the key, which is necessary for e.g. constant-time reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With just key/path -> value, there is no way to retrieve values for past versions. I considered using root+key as @tzdybal suggested, but then lookup requires knowing the root at the value's last update. Seems we would need some sort of copy-on-write versioned backing store to make this efficient.

note - this is just key+hash(value) -> value, no extra hash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that comment is outdated. Snapshotting and retrieving past version is outside the scope of this library. It can be accomplished at the db level. See https://github.com/cosmos/cosmos-sdk/blob/6a5a2de798d4e61302783ae84ecfe2958c7088d5/docs/architecture/adr-040-storage-and-smt-state-commitments.md.

Seems we would need some sort of copy-on-write versioned backing store to make this efficient.

Potentially yes, and that's an interesting direction, but as above it's outside the scope of this library. The caller can accomplish this by versioning the mapstores they pass in on SMT initialization.

TL;DR: use path -> value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller can accomplish this by versioning the mapstores they pass in on SMT initialization.

This work is actually in support of that ADR - we'll be doing exactly that, so I'm basically indifferent to the indexing used here. I just didn't want to introduce breakage.

If just path is used, supporting past value access will be dropped, but I assume proofs for non-pruned roots should still be supported? Otherwise pruning effectively becomes the only behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If just path is used, supporting past value access will be dropped, but I assume proofs for non-pruned roots should still be supported? Otherwise pruning effectively becomes the only behavior.

Good point. Proofs against past versions of the tree isn't needed in this library, since that can be accomplished caller-side. So, pruning should be the only behavior (making the prune flag unnecessary).

Copy link
Member

@musalbas musalbas Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proofs against old roots is a test case specified in the current version of this library. If this feature is to be removed because it's considered out of scope, we should be sure that this can easily be replicated caller-side with a KV store with snapshot support. What KV store supports this exactly? I checked RocksDB, but its snapshot feature is basically copying the entire DB directory, which clearly isn't efficient.

Or would the idea be that the caller can implement versioning themselves on top of any efficient KV store like this or this? If so, this doesn't seem to be suggested in the ADR, which seems to assume that the underlying KV store natively supports snapshotting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ADR states that the state commitment (i.e. the SMT, i.e. this library) isn't responsible for snapshots and historical queries, but rather the state storage is. Note that the state storage won't be a raw KV store, but will be a wrapper over one or more raw KV stores with additional logic.

State Storage requirements:

  • range queries
  • quick (key, value) access
  • creating a snapshot
  • historical versioning
  • pruning (garbage collection)

State Commitment requirements:

  • fast updates
  • tree path should be short
  • pruning (garbage collection)

Copy link
Member

@musalbas musalbas Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there has been some benchmarking done on Rocks DB etc snapshots for the ADR, this isn't a widely supported feature by KV stores, and it's yet to be seen how well this will work in practice. RocksDB snapshots seem to be designed with backups use case in mind; creating a new directory for every block could exhaust the inode limit on filesystems, given a chain with millions of blocks. BadgerDB seems to be more suitable as it's a key-value-version store, but this seems to be the only KV that supports this.

This library should act a general-purpose library. If someone wants to use this library for testing or research for example, they shouldn't have to setup complex extra scaffolding - or be tied to BadgerDB (the only KV store that seems to support efficient versioning) - to get the basic feature of accessing previous roots.

For flexibility, this library should ideally support the following two extremes:

  • Prune everything.
  • Prune nothing.

Anything in between could be implemented by the caller-supplied KV store; in the simple case, this could be delaying Del operations until e.g. 100 blocks. This would allow us to keep the ProveForRoot etc operations.

This could be implement by using a path -> value KV for this PR, and making pruning optional. The path -> value store (for efficient read ops), would only be for the latest version of the tree. However, the SMT would still need to store valueHash -> value in the nodes MapStore, to allow accessing value for old roots when pruning is disabled. When GetForRoot is called, it would get the leaf by traversing the tree directly, rather than using the path -> value store.

Copy link
Member

@musalbas musalbas Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively: this library could support "prune everything" only, but keep storing valueHash -> value in the nodes MapStore (until pruned).

It would then be trivial for the caller to implement a "prune nothing" with MapStore scaffolding that simply does nothing when Del is called. GetForRoot and ProveForRoot can then still work. However, those functions would only work for callers that have used a MapStore with this hack.

This is probably the better / easiest option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually I just realised that valueHash -> value can't be stored in either proposals, due to the ambiguity issue when pruning. So we need to think of another way to implement pruning while ideally allow for faster Get operations, without forcing people to use BadgerDB or implement their own versioning scheme, if they want to disable pruning.

smt.go Outdated Show resolved Hide resolved
smt.go Outdated Show resolved Hide resolved
@musalbas musalbas mentioned this pull request Jun 21, 2021
5 tasks
@musalbas
Copy link
Member

musalbas commented Jun 21, 2021

In light of the above discussion, I'm favouring using the original PR #36 and changing the specification for the following reasons:

  • Using a path -> value store for faster gets, should actually be the responsibility of the caller, not the SMT, which should only be responsible for the tree itself. Hence, the idea of separating the state commitment library from the storage library. Adding this KV store, would cause them to be coupled, and increase complexity.
  • Accepting the above, adding a new path -> value store just to implement pruning is wasteful, when we can just make a breaking change to the spec.
  • (See discussion above.) Without introducing a breaking change or a lot of extra complexity, it doesn't seem possible to allow people to optionally disable pruning, without forcing them to use BadgerDB (the only well-known key-value-version store), or implement their own versioning, which also requires a key-value store implementation that supports efficient range queries. This is not desirable.
  • Structuring the tree so that values don't have ambiguity to which leaf they belong to for implementation purposes, seems like something that should be in the spec anyway.

@adlerjohn
Copy link
Member

adlerjohn commented Jun 22, 2021

Adding a summary of some nuance discussed on Discord.

More important than breaking the specs (which is unimportant in the grand scheme of things) the real issue is that the current hash(value) -> value map, along with the key || hash(value) -> value map that replaces it in this PR (and the equivalent in #36), are redundant to the key -> value map that will be used in ADR-40 to allow for constant-time leaf reads. Redundant = space overhead that must be paid by all users of the library, not just users that want historical queries.

In other words, regardless of pruning orphaned nodes, this abstraction must be reworked from the current assumption that a single KV exists for both internal nodes and leaves to the assumption that two KVs exist, one for internal nodes and one for leaves. This second abstract coincidentally aligns exactly with the requirements for constant-time reads, and thus does not require any redundancy.

A suggested alternative would be to store key -> hash(value), then hash(value) -> value, adding a level of indirection. But this would then require reads to do two sequential disk accesses instead of one to get a leaf in constant time, which would be an enormous overhead.

Unfortunately, removing redundancy/this mandatory overhead means it becomes impossible to support historical queries, and pruning orphaned nodes must be enabled at all times.

@roysc
Copy link
Contributor Author

roysc commented Jun 22, 2021

This may be a silly question, but what about just removing Get value access entirely? The tree could just store its internal structure and value hashes, assuming only the responsibility of proof generation. That would completely avoid redundancy and leave value storage and versioning up to the caller.

@tzdybal
Copy link
Member

tzdybal commented Jun 22, 2021

This may be a silly question, but what about just removing Get value access entirely? The tree could just store its internal structure and value hashes, assuming only the responsibility of proof generation.

This will completely change how this library works. Right now it's a self-contained, general purpose Merkle Tree. Removing Get is a fundamental change, meaning that library can't be used "alone", without external store for values.

That would completely avoid redundancy and leave value storage and versioning up to the caller.

This will only push the problem up to the caller ;)

I agree with @adlerjohn that two KV stores should be used in library (which of course can be implemented as just prefix-buckets in single database).

@roysc
Copy link
Contributor Author

roysc commented Jun 23, 2021

This will only push the problem up to the caller ;)

Well, that's the idea. The library user can optimize for whatever they need, and the library doesn't have to try to solve for all cases. You could still provide a self-contained tree map that includes value access via one of the methods discussed. Anyway, just a thought.

@musalbas
Copy link
Member

I agree that given a trade-off between duplicating data storage, remove the Get functionality, and removing the ability to access old state roots, the latter is the least undesirable.

If there's no other options, I'm in favour of having two KV stores, where one of the KV stores is a path -> value store, and dropping the functionality of accessing old state roots.

Copy link
Member

@musalbas musalbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have consensus to change smv to a path -> value store, and drop the functionality of accessing old state roots within this library.

@tac0turtle
Copy link
Contributor

What is the timeline for the proposed changes. I would like to coordinate an audit of the library soonish.

@roysc
Copy link
Contributor Author

roysc commented Jul 2, 2021

@marbar3778 changes are pushed :) I will tag the owners for re-review.

@roysc roysc requested review from musalbas and adlerjohn July 2, 2021 11:21
deepsubtree.go Outdated Show resolved Hide resolved
Co-authored-by: John Adler <adlerjohn@users.noreply.github.com>
proofs.go Outdated Show resolved Hide resolved
@musalbas
Copy link
Member

musalbas commented Jul 8, 2021

The core PR LGMT, however there is some other things that should be do as a result of this PR:

  1. Update the example in the readme.
  2. Remove the API to get old roots.
  3. Update Get so that it reads from the path->value store directly, instead of traversing the tree.

@adlerjohn
Copy link
Member

I'll handle these in follow-up PRs so as to unblock @roysc. Good work on this PR!

@adlerjohn adlerjohn merged commit 763105b into celestiaorg:tzdybal/remove_orhpans Jul 8, 2021
adlerjohn added a commit that referenced this pull request Jul 9, 2021
* Orphans removal - simple test

* Add Option type and AutoRemoveOrphans

Using variadic arguments is very non-intrusive way to introduce options.
This is very generic, and will allow to implement other options in the
future.

* Add missing comment

* Implement removal of orphan nodes (#37)

* Implement orphan pruning

* Correct tests comments

sha256(testKey)  = 0001...
sha256(testKey2) = 1000...
sha256(foo)      = 0010...

* Remove redundant checks and copies

no case produces nil sideNodes; nothing modifies the stored hashes

* Refactor value storage

Store values in a separate mapstore indexed by (hash(value)+key);
allows pruning values

Simplify pruning loops slightly as well

* Fix array copy

* Avoid extra key param

* Fix idempotent Set

can short circuit when the same value is already set

* Deprecate past root access, remove prune as option

* Index values by just path

* Cleanup

* Update deepsubtree.go

Co-authored-by: John Adler <adlerjohn@users.noreply.github.com>

* verifyProofWithUpdates - Fix unused return value

Co-authored-by: John Adler <adlerjohn@users.noreply.github.com>

Co-authored-by: Roy Crihfield <roy@manteia.ltd>
Co-authored-by: John Adler <adlerjohn@users.noreply.github.com>
Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com>
@liamsi liamsi mentioned this pull request Sep 16, 2021
@roysc roysc deleted the tzdybal/remove_orhpans branch April 11, 2022 10:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants