Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use reference counting to avoid over-pruning trie #93

Merged
merged 2 commits into from
Aug 23, 2019

Conversation

carver
Copy link
Contributor

@carver carver commented Aug 16, 2019

What was wrong?

Fixes #92

There is an example of where the current pruner incorrectly removes a
necessary node hash, because it got duplicated. See the new test:
test_hexary_trie_avoid_over_pruning()

How was it fixed?

Added an incremental reference counter, which only prunes nodes when the
number of usages drops to zero. It only works on fresh databases. With
existing databases, it would still delete required nodes.

Also, it skips an unnecessary database write, which would incorrectly
increment the reference counter.

Cute Animal Picture

Cute animal picture

Copy link
Member

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that this will not work unless the database is empty when given to the Trie class? I.E. we cannot use this in trinity with our level db database?

@carver
Copy link
Contributor Author

carver commented Aug 22, 2019

Am I correct that this will not work unless the database is empty when given to the Trie class? I.E. we cannot use this in trinity with our level db database?

Exactly, that's what I was trying to get at with the docstring:

Important note about Pruning:

Pruning against an existing database, with duplicate values, will
delete important data. Only turn on pruning with a fresh database.
If working with an existing database, use the :meth:squash_changes
context manager instead of turning pruning on directly.

Any idea how to say that more clearly?


FWIW, not being able to prune an on-disk DB was already an issue. It's not made worse by the current PR (hopefully made a little better by at least mentioning it in a docstring). In practice, we have only ever used pruning for a fresh in-memory database, or in a context like squash_changes() which will handle it correctly.

@pipermerriam
Copy link
Member

I'm inclined to just deprecate and remove the pruning feature since:

  1. databases that need pruning are the big ones.
  2. I don't see a compelling use case for pruning in-memory databases

ergo... our pruning feature isn't useful and adding complexity to support it doesn't seem worth it.

@carver
Copy link
Contributor Author

carver commented Aug 22, 2019

2. I don't see a compelling use case for pruning in-memory databases

Two come to mind:

  1. It's used as part of squash_changes(). Since we don't have a batch-add, every time you add N values, you create N-1 state roots that you don't need anymore (plus all the downstream nodes that change). If we don't prune those out in memory, we would persist all of them to the database.
  2. The tiny transaction and receipt tries are built in memory, using a pruning trie. For basically the same reason as 1, they would write a bunch of unnecessary trie nodes to disk if we didn't prune along the way.

It's possible that we might be able to hide away the prune keyword somehow and force squash_changes as the only approach to do that. I think we could cover both use cases that way, but I'm not sure how it would look to launch a pruning trie from inside squash_changes() (which uses the prune keyword internally right now).

@pipermerriam
Copy link
Member

It's possible that we might be able to hide away the prune keyword somehow and force squash_changes as the only approach to do that.

This direction seems preferable.

There is an example of where the current pruner incorrectly removes a
necessary node hash, because it got duplicated. See the new test:
test_hexary_trie_avoid_over_pruning()

Added an incremental reference counter, which only prunes nodes when the
number of usages drops to zero. It only works on fresh databases. With
existing databases, it would still delete required nodes.

Also, skip an unnecessary database write, which now incorrectly
increments the reference counter, and causes a failure in the reference
counter.
@carver
Copy link
Contributor Author

carver commented Aug 23, 2019

It's possible that we might be able to hide away the prune keyword somehow and force squash_changes as the only approach to do that.

This direction seems preferable.

Cool, I noted it as not for external usage, and that it is likely to be deprecated. I also added an issue to change how pruning is enabled, to make it an internal API.

trie/hexary.py Outdated
self._prune_key(prune_key)

def _prune_key(self, key):
self.ref_count[key] -= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should have an existence check since if the key were to not be in the database this would end up with the refcount being negative.

@carver carver force-pushed the pruning-with-ref-counter branch from 8f7a10b to 22e21fd Compare August 23, 2019 20:56
@carver carver merged commit 3be5646 into ethereum:master Aug 23, 2019
@carver carver deleted the pruning-with-ref-counter branch August 23, 2019 21:09
pacrob added a commit to pacrob/py-trie that referenced this pull request May 12, 2023
…sion

bump sphinx version and set py version rtd uses to 3.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hexary trie prunes when it shouldn't
2 participants