Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Fix: Saving big payloads by cache CouchDB #1909

Merged
merged 3 commits into from
Sep 24, 2020
Merged

Bug Fix: Saving big payloads by cache CouchDB #1909

merged 3 commits into from
Sep 24, 2020

Conversation

przemo22
Copy link
Contributor

@przemo22 przemo22 commented Sep 22, 2020

Type of change

  • Bug fix

Description

Saving big payloads (bigger than 64kb are not saved in cache)

Additional details

When saving big payloads more than 64kb, changes are updated/saved in CouchDB, but aren't
visible in cache (nothing changed). Then when we execute stub.getState("someID") nothing has been returned
or old revision (smaller than 64kb) will be returned.

Fastcache library has got methods for updating/saving/getting bigPayloads like: GetBig and SetBig.

Release Note

FAB-18245

@przemo22 przemo22 requested a review from a team as a code owner September 22, 2020 08:42
@mastersingh24
Copy link
Contributor

mastersingh24 commented Sep 22, 2020

@przemyslaw - Thanks for the PR. Looks like you need to sign-off on your commit - you likely just need to do:

git commit --amend -s
git push -f

But the change looks pretty good to me

@przemo22
Copy link
Contributor Author

Already done

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>
@cendhu
Copy link
Contributor

cendhu commented Sep 22, 2020

When saving big payloads more than 64kb, changes are updated/saved in CouchDB, but aren't
visible in cache (nothing changed). Then when we execute stub.getState("someID") nothing has been returned
or old revision (smaller than 64kb) will be returned.

Nice catch. We did in fact discuss about this 64 KB limit (listed in the comparison table indirectly -- https://docs.google.com/document/d/1Rxczkwni5oG6MBif1s0KN6_PR29tM4WP9IEOvbLMTu8/edit?usp=sharing). However, the change of size and returning of old version didn't strike.

@manish-sethi @denyeart what's your opinion on the maximum allowed size? Should we allow any arbitray size or keep an absolute maximum? A few larger size entries migh reduce the number of entries in the cache. As it is a workload dependent, we can make it configurable. If we want simplicity, we can go with any size (as done in this PR) without introducing any config parameter.

@denyeart
Copy link
Contributor

@cendhu I would suggest keeping it simple as is done in this PR, with no new config options.
If there becomes demand for making it configurable, we can always add that later.

@@ -64,9 +65,12 @@ func (c *cache) getState(chainID, namespace, key string) (*CacheValue, error) {
}

cacheKey := constructCacheKey(chainID, namespace, key)
valBytes := cache.Get(nil, cacheKey)
valBytes := cache.GetBig(nil, cacheKey)
Copy link
Contributor

@manish-sethi manish-sethi Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clear one doubt, I had a quick look into the code of this library and unfortunately, it turned out to be correct. Blindly looking for a key using GetBig API can cause unintended consequences. I was able to construct one example that leads to a panic.

func TestDoesItWork(t *testing.T) {
	c := fastcache.New(64 * 1024 * 1024)
	c.Set([]byte("key"), []byte("sixteen-byte-val"))
	fmt.Printf("[%s]\n", c.GetBig(nil, []byte("key")))
}

It appears that the library assumes that the application remembers what keys were stored using Set API V/s SetBig API.

In order to fix the bug, my suggestion would be... to check the value size, if greater than size limit then delete the existing key from cache if present

Comment on lines 163 to 169
func setImplicitValue(c *fastcache.Cache, cacheKey []byte, value []byte) {
if cap(value) > fastCacheValueSizeLimit {
c.SetBig(cacheKey, value)
} else {
c.Set(cacheKey, value)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen if the oldValue was Big but the new value is not and vice-versa? Will the old big value be deleted before storing the new small value and vice-versa? It isn't clear from the documentation or code.

	c := fastcache.New(64 * 1024 * 1024)

	c.Set([]byte("key1"), []byte("value1"))
	val = c.Get(nil, []byte("key1"))
	fmt.Printf("Get(): Key is set with small value: length [%d], value[%s]\n", len(val), string(val))

	big := make([]byte, 2*64*1024)
	rand.Read(big)
	c.SetBig([]byte("key1"), big)
	val = c.GetBig(nil, []byte("key1"))
	fmt.Printf("GetBig(): Key is set with big value: length [%d]\n", len(val))

	val = c.Get(nil, []byte("key1"))
	fmt.Printf("Get(): Key is set with big value: length [%d], value[%s]\n", len(val), string(val))

	c.Set([]byte("key1"), []byte("value1"))
	val = c.Get(nil, []byte("key1"))
	fmt.Printf("Get(): Key is set with small value: length [%d], value[%s]\n", len(val), string(val))

	val = c.GetBig(nil, []byte("key1"))
	fmt.Printf("GetBig(): Key is set with big value: length [%d]\n", len(val))

Here, after setting the big value to an existing key with a small value, the Get() does not return nil but some random bytes. Though the GetBig() is always used in this PR first before the Get(), there is a lot of unspecified assumption and hence, the code look fragile.

There are two options:

  1. We can always use GetBig() and SetBig() irrespective of the size limit but need to benchmark the performance impact for value < 64 KB.
  2. What @manish-sethi suggested here -- https://github.com/hyperledger/fabric/pull/1909/files#r492786264

I would prefer 2. If needed, we can quantify the impact of always using big APIs and make a separate PR if needed for 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to take a note, when looking for alternate ways, I would add one more approach in which we could remember in the cache itself if the key was saved using the SetBig api (in a separate namespace) and can evaluate the overheads compared to always using Big apis.

As far as this bug fix is concerned, my inclination also would be to keep it simple as we need to back port this as well. We can maintain the existing behavior (i.e., not storing bigger values) and just fix the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already upgraded fastcache. Take a look in file changes. Currently we can use Has(<key>) method from fastcache and the problem is getting payload from cache(usage BigCache or Fastcache). So I changed this firstly by using GetBig() and if length of bytes is less than fastcache capacity then I use Get(). This will prevent from getting incorrect byte array.

Feel free for any advice/correction/refactor

Copy link
Contributor

@manish-sethi manish-sethi Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still does not solve the problem that I had highlighted in my previous comment. I'll encourage you to run locally the sample test that I posted in my comment to understand the problem with this fix. I am not sure if upgrading fastcache solves the issue. The problem is with the wrong usage of APIs. To elaborate the case, there could be a key that we previously set using the Set() api. Now, in your fix, you first look for the key using the GetBig() api and as highlighted in my test, this can lead to a panic crash if the value of the key (that is set previously) has certain properties.

For fixing the bug, as we discussed before, in the function putState, we can check the size of the value and if greater than the limit, don't call Set api and instead call Del api in order to remove the previous version of the key from the cache. This maintains the current behavior i.e., we don't store the big values in the cache and just fix the bug.

For allowing for bigger values later, we discussed a couple of options above and they can be tried and evaluated in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the update PR, the usage of Has(), GetBig(), Get(), and comparing the valueSize and sizeLimit complicates the simple cache usage. Let's go with the following suggestion for a bug fix.

For fixing the bug, as we discussed before, in the function putState, we can check the size of the value and if greater than the limit, don't call Set api and instead call Del api in order to remove the previous version of the key from the cache. This maintains the current behavior i.e., we don't store the big values in the cache and just fix the bug.

As storing > 64 KB is a new feature, let's do that separately with less complication.

Copy link
Contributor Author

@przemo22 przemo22 Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already understood that you won't take care about bigValues in cache.

  1. You want to save them only in couchDB
  2. When GetState has been called in statecouchDB try get from cache, if null get it directly from couchDb

Am I correct?

In connection with the new version I thought that using provided method like Has and HasGet will be nice.
You have seen changes in UpdateStates and getStatemethod

I did what @manish-sethi suggested to me.

Not ready yet

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>
@przemo22
Copy link
Contributor Author

Locally tests passed. Any hints how to improve them on CI?

@sykesm
Copy link
Contributor

sykesm commented Sep 23, 2020

Locally tests passed. Any hints how to improve them on CI?

Discover, report, and fix flakes. The issues you're hitting have probably been around for a long time.

https://jira.hyperledger.org/browse/FAB-17661?jql=statusCategory%20%20!%3D%20Done%20AND%20labels%20%3D%20ci-flake

@wlahti
Copy link
Contributor

wlahti commented Sep 23, 2020

Locally tests passed. Any hints how to improve them on CI?

Discover, report, and fix flakes. The issues you're hitting have probably been around for a long time.

https://jira.hyperledger.org/browse/FAB-17661?jql=statusCategory%20%20!%3D%20Done%20AND%20labels%20%3D%20ci-flake

Actually, this is a newer flake and I'm the one guilty of introducing it: https://jira.hyperledger.org/browse/FAB-18238 Just created a PR with a fix: #1916

Copy link
Contributor

@manish-sethi manish-sethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @przemyslaw. The logic of deleting the existing key looks good overall. However, I noticed one issue when I looked closely. You would need to do this unconditionally. A few minor clean up comments as well.

core/ledger/kvledger/txmgmt/statedb/statecouchdb/cache.go Outdated Show resolved Hide resolved
// If payload is bigger than fastcache can store then delete previous key.
// When GetState() has been called for deleted key then it will call database
// against fastcache.
if cap(valBytes) > fastCacheValueSizeLimit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • len(valBytes) instead of cap(valBytes)

  • Looking closely, unfortunately, even this is not going to be a sufficient condition because they way this library encodes the data internally, uint64(len(kvLenBuf) + len(k) + len(v)) is the length that matters not the length of the value only. Here, again kvLenBuf is a fixed size of 4. But I do not prefer to depend on so much of internal logic of a library. So, my suggestion would be to always delete the key irrespective of the size condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my suggestion would be to always delete the key irrespective of the size condition.

@przemyslaw A clarification on the above suggestion. I think @manish-sethi suggested to remove fastCacheValueSizeLimit altogether and all related checks associated with it. We will simply delete the existing entry before storing a new one without checking the size. Hence, before every cache.Set(), we can simply do cache.Del() (while cache.Has() is optional). For bigger values, cache.Set() would anyway ignore. I have verified this with @manish-sethi as I had some confusion. I am sharing it here so that the suggestion is explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines 96 to 98
if cache.Has(cacheKey) {
cache.Del(cacheKey)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a critical piece for this bug fix but this is not covered in the test

Comment on lines 63 to 68

// test GetState
v, err := cache.getState("ch1", "ns1", "k1")
require.NoError(t, err)
require.Nil(t, v)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the data before so that line 96-98 in cache.go gets tested.

core/ledger/kvledger/txmgmt/statedb/statecouchdb/cache.go Outdated Show resolved Hide resolved
@przemo22
Copy link
Contributor Author

@manish-sethi take a look one more time. I have found that my previous commit didn't work. Actual PR is ready for review

@manish-sethi
Copy link
Contributor

manish-sethi commented Sep 24, 2020

@manish-sethi take a look one more time. I have found that my previous commit didn't work. Actual PR is ready for review

@przemyslaw - My review comments are applicable to your latest commit. If you can address these today then it will be able to back port this PR to previous release as well.

@denyeart
Copy link
Contributor

@przemyslaw To clarify the timing, we are expecting to release a v2.2.1 tomorrow. If this PR can be merged in time, we can also quickly backport to release-2.2 so that it gets into the v2.2.1 release.

@przemo22
Copy link
Contributor Author

Corrections are ready for review (@manish-sethi, @cendhu ).
Can you give me feedback as soon as possible, please

Copy link
Contributor

@manish-sethi manish-sethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@przemyslaw - Thanks. This look good to me now. I have a few minor comments on the test code though.

Comment on lines 65 to 68
v, err := cache.getState("ch1", "ns1", "k1")
require.NoError(t, err)
require.Nil(t, v)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not relevant for this test

expectedValue := &CacheValue{Value: []byte("value")}
require.NoError(t, cache.putState("ch1", "ns1", "k1", expectedValue))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this, you could have added following to verify the initial condition for the test

v, err = cache.getState("ch1", "ns1", "k1")
require.NoError(t, err)
require.True(t, proto.Equal(expectedValue, v))

Comment on lines +76 to +75
require.NoError(t, cache.putState("ch1", "ns1", "k1", expectedValue1))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to this, you could have added a test for the function cache.UpdateStates, as that is not covered in this test.

@wlahti
Copy link
Contributor

wlahti commented Sep 24, 2020

Locally tests passed. Any hints how to improve them on CI?

Discover, report, and fix flakes. The issues you're hitting have probably been around for a long time.
https://jira.hyperledger.org/browse/FAB-17661?jql=statusCategory%20%20!%3D%20Done%20AND%20labels%20%3D%20ci-flake

Actually, this is a newer flake and I'm the one guilty of introducing it: https://jira.hyperledger.org/browse/FAB-18238 Just created a PR with a fix: #1916

FYI the fix for the integration test flake you hit (again) here was just merged into master. Rebase on master to pull in that commit and you should be good to go.

Copy link
Contributor

@manish-sethi manish-sethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @przemyslaw for your patience and consistent improvement on this PR.

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>
@manish-sethi manish-sethi merged commit 646260e into hyperledger:master Sep 24, 2020
@manish-sethi
Copy link
Contributor

@Mergifyio backport release-2.2

@mergify
Copy link

mergify bot commented Sep 24, 2020

Command backport release-2.2: failure

No backport have been created

  • Backport to branch release-2.2 failed

Cherry-pick of 646260e has failed:

On branch mergify/bp/release-2.2/pr-1909
Your branch is up to date with 'origin/release-2.2'.

You are currently cherry-picking commit 646260eb9.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:

	modified:   core/ledger/kvledger/txmgmt/statedb/statecouchdb/cache.go
	modified:   core/ledger/kvledger/txmgmt/statedb/statecouchdb/cache_test.go
	deleted:    vendor/github.com/VictoriaMetrics/fastcache/.travis.yml
	modified:   vendor/github.com/VictoriaMetrics/fastcache/README.md
	modified:   vendor/github.com/VictoriaMetrics/fastcache/bigcache.go
	modified:   vendor/github.com/VictoriaMetrics/fastcache/fastcache.go
	modified:   vendor/github.com/VictoriaMetrics/fastcache/file.go
	modified:   vendor/github.com/VictoriaMetrics/fastcache/go.mod
	modified:   vendor/github.com/VictoriaMetrics/fastcache/go.sum
	deleted:    vendor/github.com/cespare/xxhash/README.md
	deleted:    vendor/github.com/cespare/xxhash/go.mod
	deleted:    vendor/github.com/cespare/xxhash/go.sum
	deleted:    vendor/github.com/cespare/xxhash/rotate.go
	deleted:    vendor/github.com/cespare/xxhash/rotate19.go
	new file:   vendor/github.com/cespare/xxhash/v2/.travis.yml
	renamed:    vendor/github.com/cespare/xxhash/LICENSE.txt -> vendor/github.com/cespare/xxhash/v2/LICENSE.txt
	new file:   vendor/github.com/cespare/xxhash/v2/README.md
	new file:   vendor/github.com/cespare/xxhash/v2/go.mod
	new file:   vendor/github.com/cespare/xxhash/v2/go.sum
	new file:   vendor/github.com/cespare/xxhash/v2/xxhash.go
	renamed:    vendor/github.com/cespare/xxhash/xxhash_amd64.go -> vendor/github.com/cespare/xxhash/v2/xxhash_amd64.go
	renamed:    vendor/github.com/cespare/xxhash/xxhash_amd64.s -> vendor/github.com/cespare/xxhash/v2/xxhash_amd64.s
	renamed:    vendor/github.com/cespare/xxhash/xxhash_other.go -> vendor/github.com/cespare/xxhash/v2/xxhash_other.go
	renamed:    vendor/github.com/cespare/xxhash/xxhash_safe.go -> vendor/github.com/cespare/xxhash/v2/xxhash_safe.go
	new file:   vendor/github.com/cespare/xxhash/v2/xxhash_unsafe.go
	deleted:    vendor/github.com/cespare/xxhash/xxhash.go
	deleted:    vendor/github.com/cespare/xxhash/xxhash_unsafe.go
	modified:   vendor/github.com/golang/snappy/AUTHORS
	modified:   vendor/github.com/golang/snappy/CONTRIBUTORS
	modified:   vendor/github.com/golang/snappy/decode.go
	new file:   vendor/github.com/golang/snappy/decode_arm64.s
	renamed:    vendor/github.com/golang/snappy/decode_amd64.go -> vendor/github.com/golang/snappy/decode_asm.go
	modified:   vendor/github.com/golang/snappy/decode_other.go
	modified:   vendor/github.com/golang/snappy/encode.go
	new file:   vendor/github.com/golang/snappy/encode_arm64.s
	renamed:    vendor/github.com/golang/snappy/encode_amd64.go -> vendor/github.com/golang/snappy/encode_asm.go
	modified:   vendor/github.com/golang/snappy/encode_other.go
	modified:   vendor/modules.txt

Unmerged paths:
  (use "git add <file>..." to mark resolution)

	both modified:   go.mod
	both modified:   go.sum

manish-sethi pushed a commit to manish-sethi/fabric that referenced this pull request Sep 24, 2020
* Bug Fix: Saving big payloads by cache CouchDB

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>

* Update fastcache and use new API from fastcache

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>

* Cache support only small payloads

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>
Signed-off-by: manish <manish.sethi@gmail.com>
denyeart pushed a commit that referenced this pull request Sep 24, 2020
* Bug Fix: Saving big payloads by cache CouchDB

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>

* Update fastcache and use new API from fastcache

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>

* Cache support only small payloads

Signed-off-by: przemyslaw <przemo.wasala@gmail.com>
Signed-off-by: manish <manish.sethi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants