Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign the IAVL tree with pruning in mind #144

Closed
zmanian opened this issue Jun 15, 2019 · 10 comments
Closed

Redesign the IAVL tree with pruning in mind #144

zmanian opened this issue Jun 15, 2019 · 10 comments
Milestone

Comments

@zmanian
Copy link
Member

zmanian commented Jun 15, 2019

When we original designed the IAVL tree, the application state did not prune old state and all versions were kept in memory.

The current default behavior is that we prune we keep the last 100 versions and prune the 101st version. If the application has a the behavior of having "hot" keys that updated on every block, this effective doubles the disk i/o.

At the moment, I think the biggest improvement in IAVL i/o performance is that we introduce awareness of the pruning strategy into the IAVL tree so that generally we don't write values to the database except at snapshot heights.

If pruning is disabled, all values are persisted to the database.

@yutianwu
Copy link

yutianwu commented Jul 3, 2019

If we do not write values to database every block, that will improve throughput dramatically. But we need to replay all blocks from latest saved state when we restart node.

do we have a plan to do this

@zmanian
Copy link
Member Author

zmanian commented Jul 3, 2019

yeah this would be need.

IF there is a graceful shutdown, we just flush to disk before we shutdown but block replay would be great for recovery in a panic

@zmanian
Copy link
Member Author

zmanian commented Jul 3, 2019

See #150

@yutianwu
Copy link

yutianwu commented Jul 4, 2019

Actually, if we do save LastCommit when we do not save IAVL state, then blocks will be replayed automatically for the difference between state height and block height. A graceful shutdown will surely help to save the replay work.

So besides the IAVL change you mentioned, we also need to do some changes on cosmos Commit stage.

@AdityaSripal
Copy link
Member

AdityaSripal commented Jul 9, 2019

Currently working on this by building on top of loom PR (#150).

Current design

MutableTree has extra fields:

// Pruning fields
keepEvery  int64n // Saves version to disk periodically
keepRecent int64  // Saves recent versions in memory

NodeDB has extra fields:

memDb    dbm.DB     // Memory node storage.
memBatch dbm.Batch  // Batched writing buffer for memDB.

On SaveRoot, the IAVL checks if the version should be persisted to disk (version % keepEvery == 0)

If version is not going to be persisted to disk, the version is simply saved in memDB
If version is persisted to disk, the version is written to memDB and levelDB

When version n is saved, version n - keepRecent is deleted from memDB. Thus, memDB always contains keepRecent versions of the tree.

Orphans:

Save orphan to memDB under o|toVersion|fromVersion.

If there exists snapshot version snapVersion s.t. fromVersion < snapVersion < toVersion, save orphan to disk as well under o|snapVersion|fromVersion.
NOTE: in unlikely event, that two snapshot versions exist between fromVersion and toVersion, we use closest snapshot version that is less than toVersion

Can then simply use the old delete algorithm with some minor simplifications/optimizations

Open Questions:

  1. Currently recently persisted versions exist both in memDB and levelDB. This is so that retreiving a recently persisted version is fast. However, it introduces minor duplication (not a problem in any sane pruning strategy).

  2. Currently recent versions are saved to memDB. Is this better than simply storing in a map key => *Node like loom currently does? I'm not sure what the tradeoffs are.
    Decision: Decided to go with using memDB since it already implements DB interface (iterating, etc). Also, could switch out memDB for something else later so long as it respects DB interface.

  3. Now that memDB acts as a recent version "cache" for levelDB, need to specify the use (if any) for LRU cache. My thinking is that this will be used to cache old nodes (version < latest - keepRecent) that are frequently called by GetNode. But have to make sure that LRU cache's purpose is strictly defined (when does a node get added to cache?) and enforced.

  4. Currently all traverse functions traverse over single levelDB. This will have to be refactored to allow traversing over levelDB, memDB, or both. Should replace all current traversal calls with the appropriate new traversal function.
    Currently implementing:

  5. We can flush any versions in memDB to disk in event of graceful shutdown, how do we restart node correctly?
    Current thinking: Regardless of whether there is a graceful shutdown or not, on recovery, we reverse-iterate for the latestVersion stored on disk (and refill memDB if necessary).

from conversation with @jackzampolin

@zmanian
Copy link
Member Author

zmanian commented Jul 9, 2019

For recovery, Tendermint store metadata on the LastCommit and then replays all past block. Tendermint will need to know how to back up to last save commit and then replay.

Current thinking is that isn't necessarily hard.

@jackzampolin
Copy link
Member

Sounds like you should write up an approach for points 3 and 4 above and we can get some feedback on those.

@tac0turtle
Copy link
Member

This is closed with PR correct @AdityaSripal

@tac0turtle
Copy link
Member

Reopening this as a potential work scope for future iavl work.

@tac0turtle
Copy link
Member

closing this as the goal is referenced in #140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants