raftentry: rewrite raftEntryCache with HashMaps and a Replica-level LRU policy #30152

nvanbenschoten · 2018-09-12T21:01:52Z

First three commits from #30151.

!!! Disclaimer !!!

This approach has a lot of promise (see benchmarks below), but #30151 may have stolen some of its benefit by making the current raftEntryCache less of an issue. It's unclear at the moment whether such an aggressive change is still needed. At a minimum, it's probably too big of a change for 2.1.

Problem

Profiles are showing that the raftEntryCache is a performance bottleneck on certain workloads. tpcc1000 on a 3-node cluster with n1-highcpu-16 machines presents the following profiles.

This CPU profile shows that the raftEntryCache is responsible for just over 3% of all CPU utilization on the node. Interestingly, the cost isn't centralized in a single method. Instead we can see both addEntries and getEntries in the profile. Most of what we see is OrderedCache manipulation as the cache interacts with its underlying red-black tree.

This blocking profile is filtered to show all Mutex contention (ignoring other forms like channels and Cond vars). We can see that blocking in the raftEntryCache is responsible for 99% of all Mutex contention. This seems absurd, but it adds up given that the cache is responsible for 3% of CPU utilization on a node and requires mutual exclusion across an entire Store.

We've also seen in changes like #29596 how expensive these cache accesses have become, especially as the cache grows because most entry accesses take log(n) time, where n is the number of raftpb.Entrys in the cache across all Ranges on a Store.

Rewrite

This PR rewrites the raftEntryCache as a new storage/raftentries.Cache type. The rewrite diverges from the original implementation in two important ways, both of which exploit and optimize for the unique access patterns that this cache is subject to.

LLRB -> Nested HashMaps

The first change is that the rewrite trades in the balanced binary tree structure for a multi-level hashmap structure. This structure is preferable because it improves the time complexity of entry access. It's also more memory efficient and allocation friendly because it takes advantage of builtin Go
hashmaps and their compile-time specialization. Finally, it should be more GC friendly because it cuts down on the number of pointers in use.

Of course, a major benefit of a balanced binary tree is that its elements are ordered, which allows it to accommodate range-based access patterns on a sparse set of elements. While the ordering across replicas was wasted with the raftEntryCache, it did appear that the ordering within replicas was useful. However, it turns out that Raft entries are densely ordered with discrete indices. This means that given a low and a high index, we can simply iterate through the range efficiently, without the need for an ordered data-structure. This property was also exploited in 6e4e57f.

Replica-level LRU policy

The second change is that the rewrite maintains an LRU-policy across Replicas, instead of across individual Raft entries. This makes maintaining the LRU linked-list significantly cheaper because we only need to update it once per Replica access instead of once per entry access. It also means that the linked-list will be significantly smaller, resulting in far fewer memory allocations and a reduced GC footprint.

This reduction in granularity of the LRU-policy changes its behavior. Instead of the cache holding on to the last N Raft entries accessed across all Replicas, it will hold on to all Raft entries for the last N Replicas accessed. I suspect that this is actually a better policy for Raft's usage because Replicas access entries in chunks and missing even a single entry in the cache results in an expensive RocksDB seek. Furthermore, I think the LRU policy, as it was, was actually somewhat pathological because when a replica looks up its entries, it almost always starts by requesting its oldest entry and scanning up from there.

Performance

name                 old time/op    new time/op    delta
EntryCache-4           3.00ms ± 9%    0.21ms ± 2%   -92.90%  (p=0.000 n=10+10)
EntryCacheClearTo-4     313µs ± 9%      37µs ± 2%   -88.21%  (p=0.000 n=10+10)

name                 old alloc/op   new alloc/op   delta
EntryCache-4            113kB ± 0%     180kB ± 0%   +60.15%  (p=0.000 n=9+10)
EntryCacheClearTo-4    8.31kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)

name                 old allocs/op  new allocs/op  delta
EntryCache-4            2.02k ± 0%     0.01k ± 0%   -99.75%  (p=0.000 n=10+10)
EntryCacheClearTo-4      14.0 ± 0%       0.0       -100.00%  (p=0.000 n=10+10)

Testing with TPC-C is still needed. I'll need to verify that the cache hit rate does not go down because of this change and that the change translates to the expected improvements on CPU and blocking profiles.

Release note (performance improvement): Rewrite Raft entry cache to optimize for access patterns, reduce lock contention, and reduce memory footprint.

cockroach-teamcity · 2018-09-12T21:01:59Z

This change is

petermattis · 2018-09-13T00:37:43Z

Interesting. We could also switch from an llrb to a btree which might have some cache locality perf benefits, or a skiplist which which allows for concurrency.

Is this intended as a prototype? Seems late in the release cycle for this to go into 2.1.

nvanbenschoten · 2018-09-14T16:42:12Z

Yes, I agree that this is too big of a change to make it into 2.1. Luckily #30151 reduces the immediate need for this.

A more efficient tree structure would help here, but I don't think that's the fundamental issue. The problem is that regardless of how efficient we make the tree data structure, the cache is still attempting to maintain an LRU policy across all Raft entries on a Store. This severely limits how much concurrency we can achieve in the cache because at some point during each operation we need to maintain the LRU accounting for every entry accessed.

Ironically, this LRU policy isn't actually enforced correctly because OrderedCache.Ceil and OrderedCache.DoRange bypass cache.baseCache's LRU structure. It turns out that this is why #27997 works correctly, even though it shouldn't.

After reflecting on this change a bit, I think the salient change here is the migration from an entry-granularity LRU policy to a Replica-granularity LRU policy. Once we make that chage, we can revisit how we want to design the backing data-structure to hold each Replica's entries. For instance, this PR could easily swap out the inner map with a btree and still preserve the majority of the benefit/performance while also preserving the flexibility of an ordered structure.

We could also visit the proposal for a paritioned locking scheme, where a cache access turns into:

lock raftEntryCache
fetch replica partition, create if necessary
update partition's LRU ordering
unlock raftEntryCache
perform intra-partition operations (under lock or not, depending on whether we want to support concurrent access to the same partition, which we don't currently use)

nvanbenschoten · 2018-09-14T19:12:08Z

Hah, looking into a completely unrelated issue led me to #13974 (comment). Am I just a year late on all of this?

petermattis · 2018-09-15T14:59:42Z

Hah, looking into a completely unrelated issue led me to #13974 (comment). Am I just a year late on all of this?

Well, we never did anything significant back then because changes here didn't seem to move the needle. Cockroach has improved significantly in the interim, so the Raft entry cache is worth revisiting.

nvanbenschoten · 2018-09-22T06:53:01Z

I don't think what I have here is necessarily the right approach, but I think we need to do something here early in the 2.2 release cycle. This cache is currently the biggest source of contention in our system and I have reason to believe its a serious bottleneck for high throughput on large machines (> 4 cores). Here's some of the things I've seen over the past two weeks:

The cache shows up as the dominant source of lock contention in mutex profiles (~90% of all contended locking).
The cache is the source of a significant number of goroutine preemptions in execution traces. This is serious in its own right, but it's extra problematic because it is causing raft workers to be preempted while holding the raftMu.
Changing the cache access patterns even slightly can tank performance. For instance, storage: remove entries from RaftEntryCache on Raft application #30151 attempts to keep the cache small by aggressively clearing entries from it. This introduces a few more accesses, which I suspect is exacerbating the lock contention issue. As a result, when running with storage: remove entries from RaftEntryCache on Raft application #30151, the throughput of a TPC-C experiment I was running dropped by about 15% and log truncations (which also touch the cache) got completely backed up. The good news is that this may imply the potential for large throughput wins if we can speed up this cache and reduce lock contention. I haven't actually done any real perf testing on this PR in its current state, but it's worth doing to serve as another data point.

I also suspect that the cache in its current form is having a serious effect on the GC. In my TPC-C experiment, the cache averages a size of around 120,000 entries. Each entry currently results in 2 heap allocated objects (a cache.Entry+entryCacheKey+raftpb.Entry combo alloc and an llrb.Node) and 10 pointers (if I counted correctly). Lookups in the cache also result in allocations because the OrderedCache exposes a generic, interface-happy API. I should run an experiment without the cache to see its effect on the GC, but my intuition tells me that this is a serious issue.

I don't have a concrete proposal yet for what a full future version of this cache should look like, but I have a few general design considerations:

the cache's indexing structure and replacement policy should be specialized to our Raft processing access patterns. My understanding is that the cache performs rapid sequential insertion and deletion with occasional repeat access to entries while trying to catch straggler replicas up. We should verify this and then answer the following questions: Is an LRU replacement policy well suited for these access patterns? Do we need to use an ordered data structure or can we rely on dense entry indices? Is the cache accessed differently by leaders and followers? These are some of the questions I started to tackle in this PR.
what kind of concurrent access should the cache support? If we didn't have the Store-wide memory bound and the LRU policy then we could have a single cache on each Replica living under its raftMu and would have no concerns about concurrent access whatsoever. We should work backwards from that point. This implies that the answer to the question of concurrency will be dictated by the Store-wide constraints and the replacement policy we decide on.
what kind of impact should we allow the cache to have on the GC? The timestamp cache rewrite used an arena allocator and uintptrs to be completely invisible to the GC. We should almost certainly do something similar here. This may mean that we should adopt a similar approach of exposing a write-only API and performing evictions through arena "rotation".

petermattis · 2018-09-22T13:37:45Z

If we didn't have the Store-wide memory bound...

We need to be careful about removing the store-wide memory bound. That bound is extremely useful. Having per-replica memory bounds can lead to unexpected large memory usage.

This may mean that we should adopt a similar approach of exposing a write-only API and performing evictions through arena "rotation".

I doubt an LRU cache is necessary for the raft entry cache. It was likely just the tool available so it was used. Performing evictions through arena "rotation" ala the timestamp cache would likely be fine from a caching perspective, and a lot faster.

I've spent a bunch of time in the arenaskl code (see arenaskl). Specializing it for the raft entry cache would not be difficult (a week of work). One item I would pay attention to here is the interface. We cache raft.Entry and the natural way to store this in an arena would be to encode the protobuf, but that protobuf encoding might imply overhead. It might be better to custom encode the individual raft.Entry fields. That struct hasn't changed in a long time and this would be an in-memory usage so there is no long term compatibility worries.

Do we need to use an ordered data structure or can we rely on dense entry indices?

The dense entry indices are ordered though, right? I think we need ordering of some form, but you're right that the ordering doesn't imply a tree structure. I'm not seeing how we'd utilize the dense entry indices in a structure, but that could just be a failure of my brain right now. An arenaskl approach gives us something like a fifo queue with random access. That seems to match well with the requirements for the raft entry cache.

nvanbenschoten · 2018-09-24T11:10:10Z

We need to be careful about removing the store-wide memory bound. That bound is extremely useful.

Yes, I agree. I was just pointing out that it's the only reason why we need to have any synchronization at all between Replicas.

Performing evictions through arena "rotation" ala the timestamp cache would likely be fine from a caching perspective, and a lot faster.

That's my expectation as well, although I'd like to confirm that the access patterns for this cache are amenable to that.

I've spent a bunch of time in the arenaskl code (see arenaskl). Specializing it for the raft entry cache would not be difficult (a week of work). One item I would pay attention to here is the interface. We cache raft.Entry and the natural way to store this in an arena would be to encode the protobuf, but that protobuf encoding might imply overhead. It might be better to custom encode the individual raft.Entry fields. That struct hasn't changed in a long time and this would be an in-memory usage so there is no long term compatibility worries.

Yes, that's exactly what I was thinking. A similar option would be to store a struct similar to raft.Entry inline in the skiplist and have only the Data field stored in the arena. This might not be workable with how arenaskl performs atomic operations though. Either way, I'd like to follow arenaskl's approach of making this all invisible to the GC by avoiding pointers.

The dense entry indices are ordered though, right? I think we need ordering of some form, but you're right that the ordering doesn't imply a tree structure. I'm not seeing how we'd utilize the dense entry indices in a structure, but that could just be a failure of my brain right now.

What I was implying was that we should consider an approach like I have in this PR with the inner hashmap. It's able to get away with an unordered structure by relying on information already maintained by the Replica (first index in the cache = truncated index, last index in the cache = last index). This allows it to perform a few tricks, but as a result, it also constrains the interface it's able to provide.

Before this change, we only removed entries from the RaftEntryCache when the entries were truncated. We now remove them from the cache immediately after applying them. We may pull these entries back into the cache if we need to catch up a slow follower which doesn't need a snapshot, but this should be rare. The change also avoids adding entries to the RaftEntryCache at all if they are proposed and immediately committed. This effectively means that we never use the raftEntryCache in single-node clusters. This results in somewhere around a 1.5% speedup on `kv0` when tested on a single-node GCE cluster with a n1-highcpu-16 machine. Release note (performance improvement): More aggressively prune the raft entry cache, keeping it to a more manageable size.

…-level LRU policy !!! Disclaimer !!! This approach has a lot of promise (see benchmarks below), but cockroachdb#30151 may have stolen some of its benefit by making the current raftEntryCache less of an issue. It's unclear at the moment whether such an aggressive change is still needed. At a minimum, it's probably too big of a change for 2.1. _### Problem Profiles are showing that the `raftEntryCache` is a performance bottleneck on certain workloads. `tpcc1000` on a 3-node cluster with `n1-highcpu-16` machines presents the following profiles. <cpu profile> This CPU profile shows that the `raftEntryCache` is responsible for just over **3%** of all CPU utilization on the node. Interestingly, the cost isn't centralized in a single method. Instead we can see both `addEntries` and `getEntries` in the profile. Most of what we see is `OrderedCache` manipulation as the cache interacts with its underlying red-black tree. <block profile> This blocking profile is filtered to show all Mutex contention (ignoring other forms like channels and Cond vars). We can see that blocking in the `raftEntryCache` is responsible for **99%** of all Mutex contention. This seems absurd, but it adds up given that the cache is responsible for **3%** of CPU utilization on a node and requires mutual exclusion across an entire `Store`. We've also seen in changes like cockroachdb#29596 how expensive these cache accesses have become, especially as the cache grows because most entry accesses take `log(n)` time, where `n` is the number of `raftpb.Entry`s in the cache across all Ranges on a Store. _### Rewrite This PR rewrites the `raftEntryCache` as a new `storage/raftentries.Cache` type. The rewrite diverges from the original implementation in two important ways, both of which exploit and optimize for the unique access patterns that this cache is subject to. _### LLRB -> Hashmaps The first change is that the rewrite trades in the balanced binary tree structure for a multi-level hashmap structure. This structure is preferable because it improves the time complexity of entry access. It's also more memory efficient and allocation friendly because it takes advantage of builtin Go hashmaps and their compile-time specialization. Finally, it should be more GC friendly because it cuts down on the number of pointers in use. Of course, a major benefit of a balanced binary tree is that its elements are ordered, which allows it to accommodate range-based access patterns on a sparse set of elements. While the ordering across replicas was wasted with the `raftEntryCache`, it did appear that the ordering within replicas was useful. However, it turns out that Raft entries are densely ordered with discrete indices. This means that given a low and a high index, we can simply iterate through the range efficiently, without the need for an ordered data-structure. This property was also exploited in 6e4e57f. _#### Replica-level LRU policy The second change is that the rewrite maintains an LRU-policy across Replicas, instead of across individual Raft entries. This makes maintaining the LRU linked-list significantly cheaper because we only need to update it once per Replica access instead of once per entry access. It also means that the linked-list will be significantly smaller, resulting in far fewer memory allocations and a reduced GC footprint. This reduction in granularity of the LRU-policy changes its behavior. Instead of the cache holding on to the last N Raft entries accessed across all Replicas, it will hold on to all Raft entries for the last N Replicas accessed. I suspect that this is actually a better policy for Raft's usage because Replicas access entries in chunks and missing even a single entry in the cache results in an expensive RocksDB seek. Furthermore, I think the LRU policy, as it was, was actually somewhat pathological because when a replica looks up its entries, it almost always starts by requesting its oldest entry and scanning up from there. _### Performance ``` name old time/op new time/op delta EntryCache-4 3.00ms ± 9% 0.21ms ± 2% -92.90% (p=0.000 n=10+10) EntryCacheClearTo-4 313µs ± 9% 37µs ± 2% -88.21% (p=0.000 n=10+10) name old alloc/op new alloc/op delta EntryCache-4 113kB ± 0% 180kB ± 0% +60.15% (p=0.000 n=9+10) EntryCacheClearTo-4 8.31kB ± 0% 0.00kB -100.00% (p=0.000 n=10+10) name old allocs/op new allocs/op delta EntryCache-4 2.02k ± 0% 0.01k ± 0% -99.75% (p=0.000 n=10+10) EntryCacheClearTo-4 14.0 ± 0% 0.0 -100.00% (p=0.000 n=10+10) ``` Testing with TPC-C is still needed. I'll need to verify that the cache hit rate does not go down because of this change and that the change translates to the expected improvements on CPU and blocking profiles. Release note (performance improvement): Rewrite Raft entry cache to optimize for access patterns, reduce lock contention, and reduce memory footprint.

nvanbenschoten · 2018-12-19T00:13:24Z

Closed in favor of #32618.

nvanbenschoten requested review from petermattis, a-robinson and a team September 12, 2018 21:01

nvanbenschoten added 2 commits September 28, 2018 14:50

nvanbenschoten force-pushed the nvanbenschoten/raftEntryCacheRew branch from 44b66ab to 6c212b7 Compare September 28, 2018 19:05

nvanbenschoten assigned ajwerner Nov 12, 2018

nvanbenschoten mentioned this pull request Nov 13, 2018

storage: remove entries from RaftEntryCache on Raft application #30151

Closed

ajwerner mentioned this pull request Nov 26, 2018

raftentry: rewrite storage.raftEntryCache with dense, per-replica state #32618

Merged

tbg mentioned this pull request Dec 3, 2018

roachtest: kv/splits/nodes=3/quiesce=false failed kv/splits/nodes=3/quiesce=true failed [known cause] #30832

Closed

nvanbenschoten closed this Dec 19, 2018

nvanbenschoten deleted the nvanbenschoten/raftEntryCacheRew branch December 19, 2018 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raftentry: rewrite raftEntryCache with HashMaps and a Replica-level LRU policy #30152

raftentry: rewrite raftEntryCache with HashMaps and a Replica-level LRU policy #30152

nvanbenschoten commented Sep 12, 2018

cockroach-teamcity commented Sep 12, 2018

petermattis commented Sep 13, 2018

nvanbenschoten commented Sep 14, 2018

nvanbenschoten commented Sep 14, 2018

petermattis commented Sep 15, 2018

nvanbenschoten commented Sep 22, 2018

petermattis commented Sep 22, 2018

nvanbenschoten commented Sep 24, 2018

nvanbenschoten commented Dec 19, 2018

raftentry: rewrite raftEntryCache with HashMaps and a Replica-level LRU policy #30152

raftentry: rewrite raftEntryCache with HashMaps and a Replica-level LRU policy #30152

Conversation

nvanbenschoten commented Sep 12, 2018

!!! Disclaimer !!!

Problem

Rewrite

LLRB -> Nested HashMaps

Replica-level LRU policy

Performance

cockroach-teamcity commented Sep 12, 2018

petermattis commented Sep 13, 2018

nvanbenschoten commented Sep 14, 2018

nvanbenschoten commented Sep 14, 2018

petermattis commented Sep 15, 2018

nvanbenschoten commented Sep 22, 2018

petermattis commented Sep 22, 2018

nvanbenschoten commented Sep 24, 2018

nvanbenschoten commented Dec 19, 2018