Lock-Free Clock Cache #10306

guidotag · 2022-07-04T23:36:21Z

Lock-Free Clock Cache

The current default block cache implementation, the LRUCache, spends a significant amount of time waiting and handling locks, because every operation must acquire a per-shard lock. The goal of this project is to implement and evaluate an alternative block cache, that uses finer-grained locking or no locking at all, while also fixing other performance hotspots, like the eviction algorithm maintenance costs (i.e., the overhead of maintaining an LRU list) and indexing costs. Ideally, the new implementation should outperform LRUCache across a number of typical workloads, with a focus on the read path.

Part 1: First steps with FastLRUCache

FastLRUCache is a fork of the existing LRUCache that provides a prototyping space for our new cache design. Initially, FastLRUCache was simply a subset of LRUCache, on which we could incrementally develop and evaluate alternatives. The first step was to modify the existing tests to use/support the existing FastLRUCache.

We also produced a set of benchmarks that would serve as our performance baseline. The two existing benchmarking tools provided by RocksDB are cache_bench (which measures only block cache performance) and db_bench (which measures system-wide performance). I wrote notes that document my learning process with both tools, as well as some conversations with @mdcallag and @pdillinger. These notes conclude with a collection of experiments that we may use to compare the different versions of the block cache. The takeaway is that for quick performance testing we will use cache_bench on random-reads workloads, varying the size of the set of keys that a workload samples from, which we call support size. You can find the notes here. The actual code for cache_bench experiments is here and for db_bench here.

Part 2: Implementing open-addressed hash table in FastLRUCache

An open-addressed hash table stores all of its elements in a pre-allocated array of slots. This means that each handle (i.e., all element metadata, including the key) must have fixed length. Internally, RocksDB always uses FastLRUCache using 16-byte keys, but the API doesn't reflect this. The first step was forbidding other key sizes.

The next step was to implement the array pre-allocation on top of the current hash table design, namely a chained hash table. In a chained hash table, there is a base table that stores the pointers to the chains, and is dynamically resized to maintain average per-chain load of at most 1. Assuming keys are 16 bytes long, and given an estimate of the size of the values as input, we can estimate the size of a handle. Thus, for a given total capacity, we can calculate how many slots will be needed when the table becomes full.

Make the per-shard hash table fixed-size. #10154

At this point we were ready to actually implement open addressing.

At the end of this milestone, we wanted to measure the performance of FastLRUCache, compare it to the baseline, and evaluate next steps.

Here are the slides of a presentation with preliminary measurements using cache_bench. Here are the key results (the absolute values are slightly different from the slides' because of tweaks and improvements to our experiments):

This is for a 1GB cache size, 8kB blocks, 64 shards, 16 threads, 10M operations per thread. We used Folly distributed mutexes (which make LRUCache 20% faster than using std::mutex). The percentage next to each FastLRUCache implementation is the load factor enforced on the hash table.

The bottomline is that FastLRUCache does not substantially improve over LRUCache. Some profiling revealed that, contrarily to our intuition, pre-allocating the table and avoiding pointer chasing during lookups are a low order term in the total execution cost (at least in the random-reads-type workloads that we used), and that the biggest optimization opportunity is in locking and LRU maintenance. We concluded from this data that we had to jump straight into the development of a lock-free clock cache, as opposed to optimizing the hash table design.

Part 3: ClockCache with lock-free Lookup and Release

The first step was to implement a lock-based version of ClockCache. We jumpstarted the implementation using FastLRUCache's hash table design.

Clock cache #10273

The following graph shows a comparison between ClockCache, FastLRUCache and LRUCache using cache_bench, with the same parameters as before:

In the workloads with higher contention (i.e., smaller support size), this version of the ClockCache outperforms LRUCache. Under low contention, it essentially matches LRUCache.

Currently the table uses two 32-bit hashes to generate the probes. These hashes are computed on every lookup. In order to avoid recomputation, we wish to use random bits from the handle's hash, a hash provided by ShardedCache that is used to select the shard. Unfortunately, this hash is only 32 bits long, which is not enough randomness to compute the probes. The following PR generalizes the ShardedCache API to allow for arbitrary hashes.

Customizable hash function for ShardedCache #10299

#10299 is finished but not merged yet---because it changes a mid-level API, we're holding it until we are confident that the performance penalty of recomputing the probe hashes is significant enough.

We implemented midpoint insertions in clock, to allow for different insertion priorities---index and filter blocks should be harder to evict than data blocks.

Midpoint insertions in ClockCache #10305

At this point we were ready to take our first steps towards a lock-free implementation. We started making only the read operations lock-free, that is, Lookup and Release. Here is a very rough design doc. In retrospect, I realize I used this doc mostly to organize my thoughts; some of the terminology and details changed in the final version. But still, it's a good outline of our work, and it has some good questions and concerns raised by the reviewers, that ended up shaping the implementation.

Lock-free Lookup and Release in ClockCache #10347

Here are the performance measurements of lock-free ClockCache on cache_bench:

Not bad!

Part 4: Production-ready, fully lock-free clock cache

guidotag · 2022-07-05T23:36:05Z

To-do's

RoeyMaor · 2022-07-10T13:00:21Z

Hi @guidotag ,
I would love to have access to the performance comparison document:
https://docs.google.com/spreadsheets/d/1n3NiPeKJO5R7oQb41ypaJawRxLKXFlFh5MLObkaFdr8/edit?usp=sharing
seems very interesting. It has some technical issue with sending access requests

guidotag · 2022-07-22T18:53:55Z

Hi @RoeyMaor. I'm not sure if it's okay for me to give you access to internal docs. Instead, I have just embedded the main results on this website so everyone can see them.

RoeyMaor · 2022-07-31T06:25:28Z

Hi @RoeyMaor. I'm not sure if it's okay for me to give you access to internal docs. Instead, I have just embedded the main results on this website so everyone can see them.

Thanks!

frew · 2024-04-02T18:41:52Z

Hi @guidotag - curious where the clock cache implementation is at currently?

I see https://smalldatum.blogspot.com/2022/10/hyping-hyper-clock-cache-in-rocksdb.html with positive results but still see the warning at https://github.com/facebook/rocksdb/wiki/Block-Cache#hyperclockcache - we're running into some contention issues with LRUCache and wondering if it makes sense to try out ClockCache!

qtcwt · 2024-08-27T22:02:55Z

Hi there,

Is there any update on this task? What are the remaining items that prevent it from being product-ready?

We encounter a very complex bottleneck due to LRU lock contention and it seems like this hyper clock cache is the only way for us.

Thanks!

guidotag · 2024-09-03T00:45:52Z

@pdillinger there are a few folks here interested in knowing more about the status of HyperClockCache. Is it ready for production use?

guidotag added the enhancement label Jul 4, 2022

guidotag self-assigned this Jul 4, 2022

guidotag changed the title ~~Lock-free Clock Cache~~ Lock-Free Clock Cache Jul 5, 2022

gitbw95 self-assigned this Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock-Free Clock Cache #10306

Lock-Free Clock Cache #10306

guidotag commented Jul 4, 2022 •

edited

Loading

guidotag commented Jul 5, 2022 •

edited

Loading

RoeyMaor commented Jul 10, 2022

guidotag commented Jul 22, 2022

RoeyMaor commented Jul 31, 2022

frew commented Apr 2, 2024

qtcwt commented Aug 27, 2024

guidotag commented Sep 3, 2024

Lock-Free Clock Cache #10306

Lock-Free Clock Cache #10306

Comments

guidotag commented Jul 4, 2022 • edited Loading

Lock-Free Clock Cache

Part 1: First steps with FastLRUCache

Part 2: Implementing open-addressed hash table in FastLRUCache

Part 3: ClockCache with lock-free Lookup and Release

Part 4: Production-ready, fully lock-free clock cache

guidotag commented Jul 5, 2022 • edited Loading

To-do's

RoeyMaor commented Jul 10, 2022

guidotag commented Jul 22, 2022

RoeyMaor commented Jul 31, 2022

frew commented Apr 2, 2024

qtcwt commented Aug 27, 2024

guidotag commented Sep 3, 2024

guidotag commented Jul 4, 2022 •

edited

Loading

guidotag commented Jul 5, 2022 •

edited

Loading