-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raftentry: rewrite raftEntryCache with HashMaps and a Replica-level LRU policy #30152
raftentry: rewrite raftEntryCache with HashMaps and a Replica-level LRU policy #30152
Conversation
Interesting. We could also switch from an llrb to a btree which might have some cache locality perf benefits, or a skiplist which which allows for concurrency. Is this intended as a prototype? Seems late in the release cycle for this to go into 2.1. |
Yes, I agree that this is too big of a change to make it into 2.1. Luckily #30151 reduces the immediate need for this. A more efficient tree structure would help here, but I don't think that's the fundamental issue. The problem is that regardless of how efficient we make the tree data structure, the cache is still attempting to maintain an LRU policy across all Raft entries on a Store. This severely limits how much concurrency we can achieve in the cache because at some point during each operation we need to maintain the LRU accounting for every entry accessed. Ironically, this LRU policy isn't actually enforced correctly because After reflecting on this change a bit, I think the salient change here is the migration from an entry-granularity LRU policy to a Replica-granularity LRU policy. Once we make that chage, we can revisit how we want to design the backing data-structure to hold each Replica's entries. For instance, this PR could easily swap out the inner map with a btree and still preserve the majority of the benefit/performance while also preserving the flexibility of an ordered structure. We could also visit the proposal for a paritioned locking scheme, where a cache access turns into:
|
Hah, looking into a completely unrelated issue led me to #13974 (comment). Am I just a year late on all of this? |
Well, we never did anything significant back then because changes here didn't seem to move the needle. Cockroach has improved significantly in the interim, so the Raft entry cache is worth revisiting. |
I don't think what I have here is necessarily the right approach, but I think we need to do something here early in the 2.2 release cycle. This cache is currently the biggest source of contention in our system and I have reason to believe its a serious bottleneck for high throughput on large machines (> 4 cores). Here's some of the things I've seen over the past two weeks:
I also suspect that the cache in its current form is having a serious effect on the GC. In my TPC-C experiment, the cache averages a size of around 120,000 entries. Each entry currently results in 2 heap allocated objects (a I don't have a concrete proposal yet for what a full future version of this cache should look like, but I have a few general design considerations:
|
We need to be careful about removing the store-wide memory bound. That bound is extremely useful. Having per-replica memory bounds can lead to unexpected large memory usage.
I doubt an LRU cache is necessary for the raft entry cache. It was likely just the tool available so it was used. Performing evictions through arena "rotation" ala the timestamp cache would likely be fine from a caching perspective, and a lot faster. I've spent a bunch of time in the
The dense entry indices are ordered though, right? I think we need ordering of some form, but you're right that the ordering doesn't imply a tree structure. I'm not seeing how we'd utilize the dense entry indices in a structure, but that could just be a failure of my brain right now. An |
Yes, I agree. I was just pointing out that it's the only reason why we need to have any synchronization at all between Replicas.
That's my expectation as well, although I'd like to confirm that the access patterns for this cache are amenable to that.
Yes, that's exactly what I was thinking. A similar option would be to store a struct similar to
What I was implying was that we should consider an approach like I have in this PR with the inner hashmap. It's able to get away with an unordered structure by relying on information already maintained by the Replica (first index in the cache = truncated index, last index in the cache = last index). This allows it to perform a few tricks, but as a result, it also constrains the interface it's able to provide. |
Before this change, we only removed entries from the RaftEntryCache when the entries were truncated. We now remove them from the cache immediately after applying them. We may pull these entries back into the cache if we need to catch up a slow follower which doesn't need a snapshot, but this should be rare. The change also avoids adding entries to the RaftEntryCache at all if they are proposed and immediately committed. This effectively means that we never use the raftEntryCache in single-node clusters. This results in somewhere around a 1.5% speedup on `kv0` when tested on a single-node GCE cluster with a n1-highcpu-16 machine. Release note (performance improvement): More aggressively prune the raft entry cache, keeping it to a more manageable size.
…-level LRU policy !!! Disclaimer !!! This approach has a lot of promise (see benchmarks below), but cockroachdb#30151 may have stolen some of its benefit by making the current raftEntryCache less of an issue. It's unclear at the moment whether such an aggressive change is still needed. At a minimum, it's probably too big of a change for 2.1. _### Problem Profiles are showing that the `raftEntryCache` is a performance bottleneck on certain workloads. `tpcc1000` on a 3-node cluster with `n1-highcpu-16` machines presents the following profiles. <cpu profile> This CPU profile shows that the `raftEntryCache` is responsible for just over **3%** of all CPU utilization on the node. Interestingly, the cost isn't centralized in a single method. Instead we can see both `addEntries` and `getEntries` in the profile. Most of what we see is `OrderedCache` manipulation as the cache interacts with its underlying red-black tree. <block profile> This blocking profile is filtered to show all Mutex contention (ignoring other forms like channels and Cond vars). We can see that blocking in the `raftEntryCache` is responsible for **99%** of all Mutex contention. This seems absurd, but it adds up given that the cache is responsible for **3%** of CPU utilization on a node and requires mutual exclusion across an entire `Store`. We've also seen in changes like cockroachdb#29596 how expensive these cache accesses have become, especially as the cache grows because most entry accesses take `log(n)` time, where `n` is the number of `raftpb.Entry`s in the cache across all Ranges on a Store. _### Rewrite This PR rewrites the `raftEntryCache` as a new `storage/raftentries.Cache` type. The rewrite diverges from the original implementation in two important ways, both of which exploit and optimize for the unique access patterns that this cache is subject to. _### LLRB -> Hashmaps The first change is that the rewrite trades in the balanced binary tree structure for a multi-level hashmap structure. This structure is preferable because it improves the time complexity of entry access. It's also more memory efficient and allocation friendly because it takes advantage of builtin Go hashmaps and their compile-time specialization. Finally, it should be more GC friendly because it cuts down on the number of pointers in use. Of course, a major benefit of a balanced binary tree is that its elements are ordered, which allows it to accommodate range-based access patterns on a sparse set of elements. While the ordering across replicas was wasted with the `raftEntryCache`, it did appear that the ordering within replicas was useful. However, it turns out that Raft entries are densely ordered with discrete indices. This means that given a low and a high index, we can simply iterate through the range efficiently, without the need for an ordered data-structure. This property was also exploited in 6e4e57f. _#### Replica-level LRU policy The second change is that the rewrite maintains an LRU-policy across Replicas, instead of across individual Raft entries. This makes maintaining the LRU linked-list significantly cheaper because we only need to update it once per Replica access instead of once per entry access. It also means that the linked-list will be significantly smaller, resulting in far fewer memory allocations and a reduced GC footprint. This reduction in granularity of the LRU-policy changes its behavior. Instead of the cache holding on to the last N Raft entries accessed across all Replicas, it will hold on to all Raft entries for the last N Replicas accessed. I suspect that this is actually a better policy for Raft's usage because Replicas access entries in chunks and missing even a single entry in the cache results in an expensive RocksDB seek. Furthermore, I think the LRU policy, as it was, was actually somewhat pathological because when a replica looks up its entries, it almost always starts by requesting its oldest entry and scanning up from there. _### Performance ``` name old time/op new time/op delta EntryCache-4 3.00ms ± 9% 0.21ms ± 2% -92.90% (p=0.000 n=10+10) EntryCacheClearTo-4 313µs ± 9% 37µs ± 2% -88.21% (p=0.000 n=10+10) name old alloc/op new alloc/op delta EntryCache-4 113kB ± 0% 180kB ± 0% +60.15% (p=0.000 n=9+10) EntryCacheClearTo-4 8.31kB ± 0% 0.00kB -100.00% (p=0.000 n=10+10) name old allocs/op new allocs/op delta EntryCache-4 2.02k ± 0% 0.01k ± 0% -99.75% (p=0.000 n=10+10) EntryCacheClearTo-4 14.0 ± 0% 0.0 -100.00% (p=0.000 n=10+10) ``` Testing with TPC-C is still needed. I'll need to verify that the cache hit rate does not go down because of this change and that the change translates to the expected improvements on CPU and blocking profiles. Release note (performance improvement): Rewrite Raft entry cache to optimize for access patterns, reduce lock contention, and reduce memory footprint.
44b66ab
to
6c212b7
Compare
Closed in favor of #32618. |
First three commits from #30151.
!!! Disclaimer !!!
This approach has a lot of promise (see benchmarks below), but #30151 may have stolen some of its benefit by making the current
raftEntryCache
less of an issue. It's unclear at the moment whether such an aggressive change is still needed. At a minimum, it's probably too big of a change for 2.1.Problem
Profiles are showing that the
raftEntryCache
is a performance bottleneck on certain workloads.tpcc1000
on a 3-node cluster withn1-highcpu-16
machines presents the following profiles.This CPU profile shows that the
raftEntryCache
is responsible for just over 3% of all CPU utilization on the node. Interestingly, the cost isn't centralized in a single method. Instead we can see bothaddEntries
andgetEntries
in the profile. Most of what we see isOrderedCache
manipulation as the cache interacts with its underlying red-black tree.This blocking profile is filtered to show all Mutex contention (ignoring other forms like channels and Cond vars). We can see that blocking in the
raftEntryCache
is responsible for 99% of all Mutex contention. This seems absurd, but it adds up given that the cache is responsible for 3% of CPU utilization on a node and requires mutual exclusion across an entireStore
.We've also seen in changes like #29596 how expensive these cache accesses have become, especially as the cache grows because most entry accesses take
log(n)
time, wheren
is the number ofraftpb.Entry
s in the cache across all Ranges on a Store.Rewrite
This PR rewrites the
raftEntryCache
as a newstorage/raftentries.Cache
type. The rewrite diverges from the original implementation in two important ways, both of which exploit and optimize for the unique access patterns that this cache is subject to.LLRB -> Nested HashMaps
The first change is that the rewrite trades in the balanced binary tree structure for a multi-level hashmap structure. This structure is preferable because it improves the time complexity of entry access. It's also more memory efficient and allocation friendly because it takes advantage of builtin Go
hashmaps and their compile-time specialization. Finally, it should be more GC friendly because it cuts down on the number of pointers in use.
Of course, a major benefit of a balanced binary tree is that its elements are ordered, which allows it to accommodate range-based access patterns on a sparse set of elements. While the ordering across replicas was wasted with the
raftEntryCache
, it did appear that the ordering within replicas was useful. However, it turns out that Raft entries are densely ordered with discrete indices. This means that given a low and a high index, we can simply iterate through the range efficiently, without the need for an ordered data-structure. This property was also exploited in 6e4e57f.Replica-level LRU policy
The second change is that the rewrite maintains an LRU-policy across Replicas, instead of across individual Raft entries. This makes maintaining the LRU linked-list significantly cheaper because we only need to update it once per Replica access instead of once per entry access. It also means that the linked-list will be significantly smaller, resulting in far fewer memory allocations and a reduced GC footprint.
This reduction in granularity of the LRU-policy changes its behavior. Instead of the cache holding on to the last N Raft entries accessed across all Replicas, it will hold on to all Raft entries for the last N Replicas accessed. I suspect that this is actually a better policy for Raft's usage because Replicas access entries in chunks and missing even a single entry in the cache results in an expensive RocksDB seek. Furthermore, I think the LRU policy, as it was, was actually somewhat pathological because when a replica looks up its entries, it almost always starts by requesting its oldest entry and scanning up from there.
Performance
Testing with TPC-C is still needed. I'll need to verify that the cache hit rate does not go down because of this change and that the change translates to the expected improvements on CPU and blocking profiles.
Release note (performance improvement): Rewrite Raft entry cache to optimize for access patterns, reduce lock contention, and reduce memory footprint.