Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store: Concurrent iterators cause panic with iavl 0.19 #13220

Closed
2 tasks
tac0turtle opened this issue Sep 9, 2022 · 11 comments
Closed
2 tasks

Store: Concurrent iterators cause panic with iavl 0.19 #13220

tac0turtle opened this issue Sep 9, 2022 · 11 comments
Assignees
Labels

Comments

@tac0turtle
Copy link
Member

Summary of Bug

Injective reported a concurrent issue with iterators and fast node.

https://gist.github.com/achilleas-kal/1751d83e22b313a7f759add9ff21998f

We should write a couple tests in the store to test concurrency of trees being used in the sdk.

  • Write test for concurrency in the store package
  • fix iavl concurrent issue
@tac0turtle tac0turtle moved this to 📝 Todo in Cosmos-SDK Sep 9, 2022
@facundomedica facundomedica self-assigned this Sep 9, 2022
@facundomedica facundomedica moved this from 📝 Todo to 💪 In Progress in Cosmos-SDK Sep 9, 2022
@albertchon
Copy link
Contributor

We've had to rollback the cachekv iterator and store code to the version from v0.44.5, as the vanilla SDK has always resulted in panics for us. This rolled back code has worked fine for us InjectiveLabs@7c184a9

Here's some more context from a Hackerone bug report we submitted many months ago.
image

And then we run a long fuzz test, so I cannot tell you exactly how to reproduce it, since it's a fuzz test. But I can tell you the reason for the deadlock. When running the tests with -race and some additional logging, you can see the output here https://gist.github.com/gorgos/1e9b4e194b67c9c5c8e4aed064a0e774. The line prefixStore.ReverseIterator(nil, nil) calls cachekv/memiterator.go which calls newMemDBIterator. (see first stack trace) This acquires a db readlock https://github.com/tendermint/tm-db/blob/6f9a08cd4522fb13d8df69b1694587d5015114fc/memdb_iterator.go#L39 which is closed in the inner function below that runs in parallel. This function blocks in line https://github.com/tendermint/tm-db/blob/6f9a08cd4522fb13d8df69b1694587d5015114fc/memdb_iterator.go#L57. (see last stack trace) So far this works as expected I assume, with mem db waiting for iterator to either read or close.
Unfortunately, the second call prefixStore.Iterator(nil, nil) goes into dirtyItems (see second stack trace) which then calls clearUnsortedCacheSubset and then tries to set a value inside the mem db here

err := store.sortedCache.Set(item.Key, item.Value)
. But this doesn't work since the previous process still has a mem db readlock.
We have tracked down the issue that introduced this behavior to this one b9e6389.

@alexanderbez
Copy link
Contributor

The hacked code in CacheKV and tm-db continuously seem to bite us in the ass...

@facundomedica
Copy link
Member

At first, I thought this issue was caused by the update on the SDK (#12626) which now iterates through the tree without checking if the tree is mutable or not. This update looks wrong given that the MutableTree Iterator has this contract: no updates are made to the tree while an iterator is active..
But even then, trying to iterate through the ImmutableTree gives similar results which leads me to think that the issue is in the fast cache/node/iterator implemented in IAVL v0.18.

@facundomedica
Copy link
Member

I was able to replicate this issue with these:

func TestIterateConcurrency(t *testing.T) {
	tree, err := getTestTree(0)
	require.NoError(t, err)

	for i := 0; i < 100; i++ {
		go func() {
			for j := 0; j < 1000000; j++ {
				tree.Set([]byte(fmt.Sprintf("%d%d", i, j)), rand.Bytes(1))
			}
		}()
		tree.Iterate(func(key []byte, value []byte) bool {
			return false
		})
	}

}

// TestConcurrency throws "fatal error: concurrent map iteration and map write" and
// also sometimes "fatal error: concurrent map writes"
func TestIteratorConcurrency(t *testing.T) {
	tree := setupMutableTree(t)

	for i := 0; i < 1000; i++ {
		go func() {
			for j := 0; j < 1000000; j++ {
				tree.Set([]byte(fmt.Sprintf("%d%d", i, j)), rand.Bytes(1))
			}
		}()
		itr, _ := tree.Iterator(nil, nil, true)
		for ; itr.Valid(); itr.Next() {
		}
	}
}

These tests don't fail on v0.17.3 or v0.18.0 but they do fail on v0.19.0 (after fast cache was added). There are also some new unanswered comments there that we might want to take a look at cosmos/iavl#468

cc: @p0mvn I would appreciate your thoughts on this one 🙏

@p0mvn
Copy link
Member

p0mvn commented Sep 13, 2022

Hi @facundomedica

When does this pattern of concurrently Set()ing and Iterate()ing occur in the SDK?

From my understanding, all data is kept in cachekv until Commit(), at which point we Write() into IAVL. I don't think there are any iterations during Commit() so this problem shouldn't happen.

On a quick look, this comment might be related to the issue:
cosmos/iavl#468 (comment)

From my memory, we did not synchronize these maps to avoid performance overhead based on the knowledge that this read/write pattern should not happen. If the read/write patterns have changed, the maps in the IAVL discussion probably need to be synchronized

@facundomedica
Copy link
Member

When does this pattern of concurrently Set()ing and Iterate()ing occur in the SDK?

In the SDK I'm not sure we have that happening, but for sure it's possible to do so; which is what we are seeing in the stack trace shared by Injective.

@marbar3778 what do you think of: "this read/write pattern should not happen"?

@tac0turtle
Copy link
Member Author

@marbar3778 what do you think of: "this read/write pattern should not happen"?

This is the main problem. Behaviour changed because it wasn't known and still isn't known. I think we may want to add a mutex to the value to replicate previous behaviour and then define the behaviour we should aim to move towards and refactor accordingly.

@tac0turtle
Copy link
Member Author

When does this pattern of concurrently Set()ing and Iterate()ing occur in the SDK?

Looking in the sdk is not best here. Users define how they interact with the store because its not well defined in the documentation. SDK is a poor example of complex usage of the store package imo

@alexanderbez
Copy link
Contributor

alexanderbez commented Sep 13, 2022

Yeah, but also keep in mind the IAVL implementation states various invariants/contracts that must be held. So either we improve docs around this and state what is and is not allowed, or we look into supporting this behavior. Even in the later case, we should improve expected usage/contract docs.

@tac0turtle
Copy link
Member Author

@facundomedica could we document expected behaviour of iavl in the sdk docs so users understand what is possible and what is not. Since you added the test in iavl I think that's sufficient

@facundomedica
Copy link
Member

@facundomedica could we document expected behaviour of iavl in the sdk docs so users understand what is possible and what is not. Since you added the test in iavl I think that's sufficient

Here: #13386

Repository owner moved this from 💪 In Progress to 👏 Done in Cosmos-SDK Sep 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

5 participants