Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor SaveVersion() performance when using PruningOptions #256

Closed
Lbird opened this issue Jun 4, 2020 · 33 comments
Closed

Poor SaveVersion() performance when using PruningOptions #256

Lbird opened this issue Jun 4, 2020 · 33 comments
Assignees
Labels
T:bug Type: Bug (Confirmed) T:performance

Comments

@Lbird
Copy link

Lbird commented Jun 4, 2020

Pruning option of IAVL introduced with #144 is great. It could save huge portion of disk i/o burden. Nice approach. But there is a problem when dealing with a large database. Suppose the following case:

  • KeepEvery = something around 100 or 1000
  • KeepRecent = 1, which is a recommended pruning option I think
  • MutableTree size = 1,000,000
  • updates included in each version = 5
  • resulting number of nodes to save ~= 100 (depending on the tree height)

On a fresh start, recentDB is empty and it grows over time. Between snapshot versions (or between checkpoints), when we call MutableTree.Set() it saves a node in recentDB. Depending on the tree structure, several internal nodes would be touched also. When we reach a snapshot or checkpoint version, MutableTree.SaveVersion() needs to save those nodes in the recentDB to the snapshotDB. The current code "clears" root.leftNode and root.rightNode. After some update, some of the nodes would be retrieved from recentDB or snapshotDB, and mutated. Untouched branch will remain nil. And when we reach the next snapshot version, there rises a problem.

Untouched branch is marked by nil in a tree in the memory. Current SaveBranch() traversal call ndb.GetNode() when the node is marked as nil. If this node happen to be saved in the previous snapshot version, it will reside in the recentDB and snapshotDB, but the node will be loaded with node.saved = true, but node.persisted = false even if the node is saved to the disk and untouched until this snapshot version. Since node.persisted is false, SaveBranch() will be called on this node. This node will have nil children also. And ndb.GetNode() is called. This process goes on. So, eventually almost(all?) nodes in the recentDB will be saved to the snapshotDB again. And there is a bigger problem. There is no cleaning procedure for the recentDB. As the tree is mutating over time, recentDB would grow indefinitely until it store almost all the nodes in the tree in MEMORY. Suppose you have 100GB snapshotDB. That should not happen. As the recentDB grows, SaveVersion() duration will increase almost linearly. That should not happen.

There could be a design change including redefining recentDB as a write-through cache I suppose, introducing "dirty" field in the nodes stored in recentDB, or whatever. However, we can handle this with simple remedy for now.

diff --git a/nodedb.go b/nodedb.go
index afa2e29..724f952 100644
--- a/nodedb.go
+++ b/nodedb.go
@@ -146,14 +146,14 @@ func (ndb *nodeDB) SaveNode(node *Node, flushToDisk bool) {
                panic(err)
        }

-       if !node.saved {
-               node.saved = true
-               ndb.recentBatch.Set(ndb.nodeKey(node.hash), buf.Bytes())
-       }
        if flushToDisk {
                ndb.snapshotBatch.Set(ndb.nodeKey(node.hash), buf.Bytes())
+               ndb.recentBatch.Delete(ndb.nodeKey(node.hash))
                node.persisted = true
                node.saved = true
+       } else if !node.saved {
+               node.saved = true
+               ndb.recentBatch.Set(ndb.nodeKey(node.hash), buf.Bytes())
        }
 }
@erikgrinaker
Copy link
Contributor

Hi @Lbird, thank you for the thorough report and analysis! Your proposed patch seems reasonable, we'll have a closer look at this.

@erikgrinaker erikgrinaker self-assigned this Jun 4, 2020
@erikgrinaker erikgrinaker added T:bug Type: Bug (Confirmed) T:performance labels Jun 4, 2020
@zmanian
Copy link
Member

zmanian commented Jun 4, 2020

Okay this is exactly what I'm seeing on the game of zones nodes.

@zmanian
Copy link
Member

zmanian commented Jun 4, 2020

What I'm doing to do is make a iavl-0.13.4-beta branch with the proposed fix and try it on the game of zones hub.

Want to take the opportunity to try this on a high load network.

zmanian added a commit to zmanian/iavl that referenced this issue Jun 4, 2020
@zmanian
Copy link
Member

zmanian commented Jun 4, 2020

Running zmanian@5f042ac on game of zones hub now.

Monitoring for changes in behavior. It took several hours after a restart for RAM to really blow up.

@zmanian
Copy link
Member

zmanian commented Jun 5, 2020

zmanian@5f042ac appears to have slowed the memory growth but not stopped it.
6 hours later and we are 8GB

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  10358 root      20   0   10.1g   8.0g  31184 S 408.6  20.9 336:32.18 gaiad

@Lbird
Copy link
Author

Lbird commented Jun 5, 2020

zmanian@5f042ac appears to have slowed the memory growth but not stopped it.
6 hours later and we are 8GB

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  10358 root      20   0   10.1g   8.0g  31184 S 408.6  20.9 336:32.18 gaiad

Maybe, we need to avoid using memdb altogether. MemDB seems to be not adequate for long-term use.

@Lbird
Copy link
Author

Lbird commented Jun 5, 2020

I remember I saw some code snippet in cosmos-sdk (I forgot the exact location though) that sets a pruning option to be

  • keepEvery > 1
  • keepRecent = 1

And I think it is the intended use of pruning option. That being said, it is no use to keep multiple versions in recentDB. And that means we can close and remove recentDB after we save the tree to the snapshotDB. And we can open a new fresh recentDB for use in intermediate versions until the next snapshot version. It will solve the memory problem also. Any objection to this idea? I can make a PR for this.

@Lbird
Copy link
Author

Lbird commented Jun 5, 2020

I remember I saw some code snippet in cosmos-sdk (I forgot the exact location though) that sets a pruning option to be

  • keepEvery > 1
  • keepRecent = 1

And I think it is the intended use of pruning option. That being said, it is no use to keep multiple versions in recentDB. And that means we can close and remove recentDB after we save the tree to the snapshotDB. And we can open a new fresh recentDB for use in intermediate versions until the next snapshot version. It will solve the memory problem also. Any objection to this idea? I can make a PR for this.

The above idea requires non-trivial interface change since iavl.NewMutableTreeWithOpts() is open to public and it receives recentDB as an argument. So we cannot guarantee that recentDB is always MemDB.

Anyway, I think there are two options to solve the issue:

  • Add periodic purging procedure for the recentDB along with some interface change or hiding iavl.NewMutableTreeWithOpts() from the package interface.
  • Since it seems that https://github.com/google/btree is responsible for the constant memory increase, fork btree and find and fix the problem within btree. Or, find another ordered tree implementation and replace the current btree from google.

@erikgrinaker
Copy link
Contributor

Since it seems that https://github.com/google/btree is responsible for the constant memory increase, fork btree and find and fix the problem within btree.

Interesting, are you seeing an actual memory leak in btree? It's not just that we're accumulating garbage data in the database, or a Go GC issue?

@Lbird
Copy link
Author

Lbird commented Jun 5, 2020

Since it seems that https://github.com/google/btree is responsible for the constant memory increase, fork btree and find and fix the problem within btree.

Interesting, are you seeing an actual memory leak in btree?

I'm not 100% sure yet. And I'm not sure it is a leak. But I've seen real RES increase over time in a sample program. tm-db.MemDB.Delete() is quite simple, and I couldn't find any flaws in MemDB code. So, my guess is that underlying btree is responsible for the memory increase. I've seen some sub-slicing in btree's code. I'll take a closer look.

@zmanian
Copy link
Member

zmanian commented Jun 5, 2020

One thing I've observed is that when the btree gets large, ctrl-c on the node takes >10 min and maxes all available CPUs to shutdown.

@zmanian
Copy link
Member

zmanian commented Jun 5, 2020

And I think it is the intended use of pruning option. That being said, it is no use to keep multiple versions in recentDB. And that means we can close and remove recentDB after we save the tree to the snapshotDB. And we can open a new fresh recentDB for use in intermediate versions until the next snapshot version. It will solve the memory problem also. Any objection to this idea? I can make a PR for this.

I'm interested in trying out a PR with this change.

@AdityaSripal
Copy link
Member

Hi @Lbird, thank you for the detailed issue!! Your initial proposed fix makes sense, though perhaps an eventual fix should involve GetNode setting the node.persisted flag correctly if it exists in both recentDB and snapshotDB (which it currently doesn't as outlined by your initial comment).

// Doesn't exist, load.
buf, err := ndb.recentDB.Get(ndb.nodeKey(hash))
if err != nil {
	panic(fmt.Sprintf("can't get node %X: %v", hash, err))
}
persisted := false
if buf == nil {
	// Doesn't exist, load from disk
	buf, err = ndb.snapshotDB.Get(ndb.nodeKey(hash))
	if err != nil {
		panic(err)
	}
	if buf == nil {
		panic(fmt.Sprintf("Value missing for hash %x corresponding to nodeKey %s", hash, ndb.nodeKey(hash)))
	}
	persisted = true
}

Purging and restarting the memDB seems like something worth trying, though I don't see why it would be an issue if keepRecent=1? Shouldn't the Go GC purge every node that isn't in the latest tree anyway?

@Lbird
Copy link
Author

Lbird commented Jun 8, 2020

Hi @Lbird, thank you for the detailed issue!! Your initial proposed fix makes sense, though perhaps an eventual fix should involve GetNode setting the node.persisted flag correctly if it exists in both recentDB and snapshotDB (which it currently doesn't as outlined by your initial comment).

// Doesn't exist, load.
buf, err := ndb.recentDB.Get(ndb.nodeKey(hash))
if err != nil {
	panic(fmt.Sprintf("can't get node %X: %v", hash, err))
}
persisted := false
if buf == nil {
	// Doesn't exist, load from disk
	buf, err = ndb.snapshotDB.Get(ndb.nodeKey(hash))
	if err != nil {
		panic(err)
	}
	if buf == nil {
		panic(fmt.Sprintf("Value missing for hash %x corresponding to nodeKey %s", hash, ndb.nodeKey(hash)))
	}
	persisted = true
}

I thought about that also. But it will harm the general performance in intermediate versions between snapshot versions. I thought about adding new method like GetNodeToSave(), but it is ugly I think.

Purging and restarting the memDB seems like something worth trying, though I don't see why it would be an issue if keepRecent=1? Shouldn't the Go GC purge every node that isn't in the latest tree anyway?

I see, if everything is as expected, there should be no problem. MemDB code instructs the btree code to delete obsolete nodes. But there is something wrong somewhere between MemDB and btree.

Output from my sample program:

yhyang@yhyang-Mac:~/src/amolabs/iavl_prune_test% ./iavl_test 1000 100000 5
iavl_test
map[database.size:0 database.type:memDB]
----
tree version: 178100 size: 1390500 772.172µs [ 97 0 ]
map[database.size:6937 database.type:memDB]
----
tree version: 178200 size: 1391000 1.297618ms [ 104 0 ]
map[database.size:12744 database.type:memDB]
----
tree version: 178300 size: 1391500 753.68µs [ 97 0 ]
map[database.size:18242 database.type:memDB]
----
tree version: 178400 size: 1392000 787.37µs [ 105 0 ]
map[database.size:23460 database.type:memDB]
----
tree version: 178500 size: 1392500 1.38724ms [ 105 0 ]
map[database.size:28539 database.type:memDB]
----
tree version: 178600 size: 1393000 805.08µs [ 103 0 ]
map[database.size:33535 database.type:memDB]
----
tree version: 178700 size: 1393500 912.556µs [ 105 0 ]
map[database.size:38361 database.type:memDB]
----
tree version: 178800 size: 1394000 1.067723ms [ 108 0 ]
map[database.size:42994 database.type:memDB]
----
tree version: 178900 size: 1394500 849.831µs [ 105 0 ]
map[database.size:47537 database.type:memDB]
----
tree version: 179000 size: 1395000 1.926445532s [ 0 52012 ]
map[database.size:1 database.type:memDB]
----
tree version: 179100 size: 1395500 666.931µs [ 108 0 ]
map[database.size:6946 database.type:memDB]

MemDB size is initially zero. But after saving at tree version 179000, the size becomes 1, not zero. I think that is a root node maybe?

@Lbird
Copy link
Author

Lbird commented Jun 8, 2020

The remaining 1 item turned out to be a version root information from saveRoot(). I applied the following patch:

diff --git a/nodedb.go b/nodedb.go
index b28b664..852f427 100644
--- a/nodedb.go
+++ b/nodedb.go
@@ -767,9 +767,11 @@ func (ndb *nodeDB) saveRoot(hash []byte, version int64, flushToDisk bool) error

        key := ndb.rootKey(version)
        ndb.updateLatestVersion(version)
-       ndb.recentBatch.Set(key, hash)
        if flushToDisk {
                ndb.snapshotBatch.Set(key, hash)
+               ndb.recentBatch.Delete(key)
+       } else {
+               ndb.recentBatch.Set(key, hash)
        }

        return nil

But, the memory still increase over time. I'll look into btree code.

@Lbird
Copy link
Author

Lbird commented Jun 8, 2020

Couldn't find any flaw in btree code. And even after applying purging logic, there Is no notable difference in memory usage in one-hour testings. Strange to say, the memory usage increases even with default option. In order to tell there is an unrecoverable memory usage increase, I need some long-term testing. Maybe tens of hours. :(

@erikgrinaker
Copy link
Contributor

Thanks again for looking into this @Lbird, it's much appreciated! I hope to have some time to dig into this myself later today. It's quite possible that this is a Go GC issue as well, it can have rather unintuitive behavior at times.

@Lbird
Copy link
Author

Lbird commented Jun 9, 2020

Thanks again for looking into this @Lbird, it's much appreciated! I hope to have some time to dig into this myself later today. It's quite possible that this is a Go GC issue as well, it can have rather unintuitive behavior at times.

Oh, it's good to hear that! :)

BTW, I ran long-run (roughly 7-8 hours) tests overnight, and the result is interesting. With all the identical load, default option gave 404MB max RES, while pruning option and pruning option with purging procedure gave 451MB and 457MB max RES respectively. Purging procedure is indeed not very helpful, pruning option yield ~50MB more max RES anyway. I don't know what gives ~400MB base RES, but ~50MB more RES is quite reasonable considering usage of recentDB. Maybe there is no memory leak related to recentDB?

@erikgrinaker
Copy link
Contributor

I don't know what gives ~400MB base RES, but ~50MB more RES is quite reasonable considering usage of recentDB. Maybe there is no memory leak related to recentDB?

Yes, that sounds reasonable to me. The Go garbage collector rarely actually frees memory, instead keeping it allocated for future use. It may be more useful to look at runtime.ReadMemStats() and pprof heap dumps instead of the opaque RSS. I'm going to run some memory profiles here as well, will report what I find.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 9, 2020

Having done quite a bit of testing with your patch, I don't see any evidence of memory leaks - running thousands of versions with thousands of random create/update/delete operations per version biased towards retaining 4096 keys total shows a stable heap and memDB size over time. This was with KeepEvery: 10 and KeepRecent: 10.

However, there does appear to be a problem with the pruning retaining more memDB data than it should. With the same pruning settings (KeepEvery: 10 and KeepRecent: 10), and only doing creates (4096 new keys per version), the number of memDB items grows without bound. With KeepRecent: 10 I would expect it to to keep roughly 40960 items in memDB, i.e. the new items created in the last 10 versions, but it's retaining far more than that. I've set aside the next few days to look into this and other IAVL issues, will dig further tomorrow.

@erikgrinaker
Copy link
Contributor

I've been busy working on a different IAVL issue, but had a quick look at this again today. The memory growth we see even with your patch is due to inner node write amplification, and is inherent to IAVL (and Merkle trees in general): as the tree grows, the number of inner nodes affected by a single key change is O(log n), and the probability that an inner node is shared by multiple changed keys is similarly reduced. I don't see any way to avoid this with the current design. However, the memory cost per key change should level out over time as O(log n), and I've run some benchmarks that seem to confirm this.

However, I'm not sure if the suggested PR is necessarily the right fix. Let's say I configure IAVL with KeepEvery: 10 and KeepRecent: 5. At version 12, I would expect versions 8-12 to be in memory, but with your PR version 10 will only be available on disk, reducing read performance for this version. I feel like this breaks user expectations, at least with the current pruning design.

I will go through your original issue comment and the IAVL code more closely once I finish work on a different bug - without your PR, the memory usage is far, far higher than it should be, and I wonder if there might be an alternative quick fix which would still retain all KeepRecent versions in memory. Or if we should perhaps change the pruning design.

Thank you again for bringing this up!

@zouxyan
Copy link

zouxyan commented Jun 16, 2020

Is pruning opt 'syncable' not working on cosmos-sdk v0.38.4 ? I can's get proof data from any version but the latest one. Any help, plz.

@erikgrinaker
Copy link
Contributor

Hi @zouxyan. With syncable, only the latest version and any versions that are written to disk (i.e. every 10000 versions) are available. All other versions are deleted. If you want to retain historical versions, you must use the nothing pruning strategy.

@zouxyan
Copy link

zouxyan commented Jun 16, 2020

Hi @zouxyan. With syncable, only the latest version and any versions that are written to disk (i.e. every 10000 versions) are available. All other versions are deleted. If you want to retain historical versions, you must use the nothing pruning strategy.
How about the latest 100 block. Can I fetch them ? Did the strategy change ?

@erikgrinaker
Copy link
Contributor

How about the latest 100 block. Can I fetch them ? Did the strategy change ?

No, all versions except the latest (and each version that is divisible by 10000) are deleted. In previous version of Cosmos SDK, there was no pruning so all versions were always retained.

@zouxyan
Copy link

zouxyan commented Jun 16, 2020

@erikgrinaker Thanks.

@alexanderbez
Copy link
Contributor

How about the latest 100 block. Can I fetch them ? Did the strategy change ?

No, all versions except the latest (and each version that is divisible by 10000) are deleted. In previous version of Cosmos SDK, there was no pruning so all versions were always retained.

There was still pruning, but it's just that each version was always flushed to disk.

@alexanderbez
Copy link
Contributor

@zouxyan we've introduced granular pruning options in the CLI/config. I can't recall if we backported this to 0.38, but you can check. If so, you can tweak these values to whatever you like. IMHO the default strategy is too conservative. It should flush to disk more often.

@Lbird
Copy link
Author

Lbird commented Jun 18, 2020

Thank you! I have some comments I hope you to consider.

I've been busy working on a different IAVL issue, but had a quick look at this again today. The memory growth we see even with your patch is due to inner node write amplification, and is inherent to IAVL (and Merkle trees in general): as the tree grows, the number of inner nodes affected by a single key change is O(log n), and the probability that an inner node is shared by multiple changed keys is similarly reduced. I don't see any way to avoid this with the current design. However, the memory cost per key change should level out over time as O(log n), and I've run some benchmarks that seem to confirm this.

Well, this is true only when there are small number writes in the relatively large database, say 100 ~ 1000 write in the tree with million leaves. But as the blockchain node keeps running very long time, recentDB size may not be O(log n). Suppose a node starts with the million-leaf tree, and keep running until the tree holds 2 million leaves. The size of recentDB would be roughly k times O(log n), where k is not negligible compared to n. This is obvious when thinking a node which starts with the empty tree and keeps running until the tree grows to 10GB size. The size of the recentDB will be O(n log n) = whole tree in the memory. Without periodic restart or periodic purging, it is inevitable. I think something must be done anyway. Honestly, my suggested patch was for the case with keepRecent = 1. So it may not be a proper solution to general cases.

However, I'm not sure if the suggested PR is necessarily the right fix. Let's say I configure IAVL with KeepEvery: 10 and KeepRecent: 5. At version 12, I would expect versions 8-12 to be in memory, but with your PR version 10 will only be available on disk, reducing read performance for this version. I feel like this breaks user expectations, at least with the current pruning design.

For the read performance, you are right when there are massive read operations for all nodes in the recentDB. But consider this: all nodes in the recentDB are the nodes touched(updated or inserted) some time in the past. Without purging, it means all the touched nodes since the tree is loaded from the disk. It is difficult for me to imagine a tx which requires read operation for the majority of the touched nodes in the tree. For the generic read performance, isn't it the responsibility of the cache anyway? And there is already a cache mechanism.

I will go through your original issue comment and the IAVL code more closely once I finish work on a different bug - without your PR, the memory usage is far, far higher than it should be, and I wonder if there might be an alternative quick fix which would still retain all KeepRecent versions in memory. Or if we should perhaps change the pruning design.

Right, it is a rather difficult issue if keeping the original design. If you are considering a design change, my suggestion is to change the role of recentDB to be something like write buffer which offers limited read performance boost between the snapshots or checkpoints. BTW, I prefer the term checkpoint to snapshot. It sounds more clear.

Thank you again for bringing this up!

Since this is a blocking issue for our application, I have great interest to this issue. I hope to hear from you soon. Thank you!

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 18, 2020

The size of the recentDB will be O(n log n) = whole tree in the memory. Without periodic restart or periodic purging, it is inevitable.

Yes, this is the core of the problem; it seems like the current design expects to eventually keep the entire tree in memory.

Without periodic restart or periodic purging, it is inevitable. I think something must be done anyway. Honestly, my suggested patch was for the case with keepRecent = 1. So it may not be a proper solution to general cases.

I have run some tests, and the behavior at least seems to be correct for keepRecent != 1. I'm currently writing additional test cases to make sure it doesn't have any adverse effects, but I think we should be able to merge your patch shortly as a stopgap solution.

For the generic read performance, isn't it the responsibility of the cache anyway?

I agree.

my suggestion is to change the role of recentDB to be something like write buffer which offers limited read performance boost between the snapshots or checkpoints

This is what I've been thinking as well, we're considering it for the medium term.

Since this is a blocking issue for our application, I have great interest to this issue.

I'm aiming to do a patch release with this fix and a couple of other bug fixes next week, pending further testing.

@erikgrinaker
Copy link
Contributor

@Lbird I'm afraid we will be removing the current pruning implementation. There are several problems with it that I don't think can be fully addressed with the current design, and we unfortunately don't have the resources to do a redesign right now. We will be making a new release within the week.

@Lbird
Copy link
Author

Lbird commented Jun 23, 2020

@erikgrinaker OK. I understand. It seems you may close this issue and the PR.

We will stick to the forked iavl until there is some progress in pruning feature.

@erikgrinaker
Copy link
Contributor

Sure. Please be aware that it has multiple known data corruption bugs, see e.g. #261. If you decide to keep running it, you should avoid changing your pruning settings, deleting the latest persisted version, or using LoadVersionForOverwriting().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:bug Type: Bug (Confirmed) T:performance
Projects
None yet
Development

No branches or pull requests

6 participants