-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: implement cache abstraction and unit test #506
Conversation
* implementing cache abstraction, unit test Add * create benchmark for Add(), start working on benchmark for Remove() * implement no duplicates in cache and unit test * switch fast node cache to the new abstraction * switch regular cache to the new abstraction * fmt and add interface godoc * rename receiver to c from nc * const fastNodeCacheLimit * move benchmarks to a separate file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing, thank you!
could you resolve the conflicts then we can merge it, thank you |
Did anyone make an in depth review? |
In addition to @marbar3778 's review, @alexanderbez helped with looking into this in the Osmosis fork a few months ago: osmosis-labs#38 I'm happy to wait if anyone else would like to look at this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider using https://github.com/hashicorp/golang-lru/tree/master/simplelru rather than implementing our own lRU.
dict map[string]*list.Element // FastNode cache. | ||
cacheLimit int // FastNode cache size limit in elements. | ||
ll *list.List // LRU queue of cache elements. Used for deletion. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we use https://github.com/hashicorp/golang-lru or https://github.com/hashicorp/golang-lru/tree/master/simplelru ? We are already using it in various places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked into these options and, unfortunately, they do not allow for enough customization necessary for future work.
I'm planning to further improve the custom cache by adding the ability to have 2 kinds of limits:
- in terms of bytes
- in terms of the number of nodes
I have this work started here that I'm hoping to move to this repository: osmosis-labs#40
More background about why I'm introducing this cache and its limits can be found here: osmosis-labs/osmosis#1163
In short, we found it not intuitive configuring the limits in terms of nodes so we would like to provide the ability to do so in bytes only for fast cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's add a short comment in the code, before someone else will come and propose to remove this and reuse other library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p0mvn how about https://dgraph.io/blog/post/introducing-ristretto-high-perf-go-cache/ It supports setting byte high watermarks and that blogpost talks about their journey in figuring out some LRUs and caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@odeke-em thanks for the suggestion. I've investigated using ristretto by swapping it for fast nodes cache but have run into issues.
The major one is a ~17% increase in time/op. I suspect that is due to ristretto having overhead from concurrency guarantees which we might not really need.
Also, some randomized tests started failing for reasons I could not explain.
The current state of the attempt, summary, and benchmarks can be found here: #508
nodedb.go
Outdated
opts: *opts, | ||
latestVersion: 0, // initially invalid | ||
nodeCache: cache.New(cacheSize), | ||
fastNodeCache: cache.New(fastNodeCacheLimit), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we still need fastNodeCache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to benchmarks, removing it significantly degrades the performance: osmosis-labs/osmosis#1041
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link! It seams we both had the same concern :) Good job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at it too fast. I don't see a benchmark with nodeCache and no-fastNodeCache
. The benchmark you have is noCache
vs with fastNodeCache
. Sorry, maybe I'm not interpreting it correctly as I don't see the whole setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just wasn't consistent with how I was referring to the benchmarks, sorry for the confusion.
The comparison is most certainly between both node caches
and regular cache but no-fastNodeCache
.
When I referred to:
cache.log
= both node caches
nocache.log
= regular cache but no-fastNodeCache
Looking at the benchstat, the delta is coming mostly from query-hits-fast-16
and iteration-fast-16
, proving that the regular cache was untouched. If it was touched, we would see delta in query-hits-slow-16
and iteration-slow-16
To explain the reasons for trying this, our node operators were getting OOM'ed. pprof results were showing memory growth only specific to fast cache. There was never a motivation for trying things w/o regular cache.
To mitigate the memory growth we decided to hardcode the value for the fast cache to 100000
Line 33 in d07ef28
fastNodeCacheLimit = 100000 |
The value was chosen by spinning up a bunch of nodes at 10k, 50k, 100k, 150 and 200k. 100k proved to be the most optimal.
Spinning up several nodes to find a good value seemed like an unnecessarily complex solution. That's why I decided to do this refactor to eventually introduce byte cache for fast nodes. I'm also hoping to expose a parameter in config to let node operators choose this byte value.
@robert-zaremba thanks for the review. Addressed all comments in threads. Please let me know what you think |
gobencher is failing for unrelated to this PR reasons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @p0mvn for the work! Thanks @marbar3778 for the ping!
@p0mvn I've added some comments, please take a look.
type Cache interface { | ||
// Adds node to cache. If full and had to remove the oldest element, | ||
// returns the oldest, otherwise nil. | ||
Add(node Node) Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when someone adds a nil Node? That'll break this entire interface. If nil nodes can be permitted, it might be much more useful to return (last Node, evicted bool)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should never be a nil Node added. Would it be sufficient to add a comment saying that the contract for this method is that node
is never nil?
From reading this discussion, I could either:
- use reflection to check against nil in the
Add
method - define
IsNil()
method for theNode
interface and implement it for every node struct.
However, if we were to go with one of the approaches, I think that this would be trading performance for safety unnecessarily at such a low database layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment will perhaps suffice. Thank you @p0mvn.
Add(node Node) Node | ||
|
||
// Returns Node for the key, if exists. nil otherwise. | ||
Get(key []byte) Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the value stored is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should never be possible for the value to be nil.
Currently, I added the CONTRACT
to the spec of Add(node Node)
explaining that node
must never be nil.
Given the contract, it is impossible to have a nil value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
dict map[string]*list.Element // FastNode cache. | ||
cacheLimit int // FastNode cache size limit in elements. | ||
ll *list.List // LRU queue of cache elements. Used for deletion. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p0mvn how about https://dgraph.io/blog/post/introducing-ristretto-high-perf-go-cache/ It supports setting byte high watermarks and that blogpost talks about their journey in figuring out some LRUs and caches.
cache/cache.go
Outdated
elem := c.ll.PushFront(node) | ||
c.dict[string(node.GetKey())] = elem | ||
|
||
if c.ll.Len() > c.cacheLimit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be >= ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I named this limit
, I meant maximum
, including in tests. I changed the names to reflect that so that the condition is kept as is.
cache/cache_bench_test.go
Outdated
b.Run(name, func(b *testing.B) { | ||
for i := 0; i < b.N; i++ { | ||
_ = cache.Add(&testNode{ | ||
key: randBytes(tc.keySize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I highly suspect that randBytes is going to skew the benchmark results due to the fact that it reads from crypto/rand.Reader. Please either
- invoke b.StopTimer(), generate the bytes, the b.StartTimer()
Or
- At init time, generate a whole lot of these keys then just iterate by the keyCount
But option 1 is the most pragmatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with option 1, thanks
cache/cache_bench_test.go
Outdated
|
||
for i := 0; i < b.N; i++ { | ||
key := existentKeyMirror[r.Intn(len(existentKeyMirror))] | ||
b.ResetTimer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark doesn't make any sense. You keep resetting the the timer in the most important of the benchmark for every b.N iteration. This looks like a mistake.
cache/cache_bench_test.go
Outdated
|
||
randSeed := 498727689 // For deterministic tests | ||
r := rand.New(rand.NewSource(int64(randSeed))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should have instead invoked b.ResetTimer() before the loop.
@p0mvn any chance you will finish this PR this week? |
Yes, I'll aim to complete this in the next couple of days. Apologies for the delay |
thanks! |
@robert-zaremba @odeke-em addressed your comments, please take a look when you have a chance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK.
Would be also good to include a short doc explaining diff between nodeCache and fastNodeCache.
Added comments for the difference between 2 cache fields in the struct declaration. I think it would be good to have a README explaining the motivation for the fast index change overall. I can work on this over the next week and open a separate PR. |
test is still failing |
Hey everyone. This is ready for review / final approval. Please take a look when you can |
Backport of: osmosis-labs#38
Context
This change has been running on Osmosis for almost 2 months so it is relatively low-risk.
Original Benchstat from Osmosis fork