Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock-Free Clock Cache #10306

Open
17 tasks done
guidotag opened this issue Jul 4, 2022 · 7 comments
Open
17 tasks done

Lock-Free Clock Cache #10306

guidotag opened this issue Jul 4, 2022 · 7 comments
Assignees

Comments

@guidotag
Copy link
Contributor

guidotag commented Jul 4, 2022

Lock-Free Clock Cache

The current default block cache implementation, the LRUCache, spends a significant amount of time waiting and handling locks, because every operation must acquire a per-shard lock. The goal of this project is to implement and evaluate an alternative block cache, that uses finer-grained locking or no locking at all, while also fixing other performance hotspots, like the eviction algorithm maintenance costs (i.e., the overhead of maintaining an LRU list) and indexing costs. Ideally, the new implementation should outperform LRUCache across a number of typical workloads, with a focus on the read path.

Part 1: First steps with FastLRUCache

FastLRUCache is a fork of the existing LRUCache that provides a prototyping space for our new cache design. Initially, FastLRUCache was simply a subset of LRUCache, on which we could incrementally develop and evaluate alternatives. The first step was to modify the existing tests to use/support the existing FastLRUCache.

We also produced a set of benchmarks that would serve as our performance baseline. The two existing benchmarking tools provided by RocksDB are cache_bench (which measures only block cache performance) and db_bench (which measures system-wide performance). I wrote notes that document my learning process with both tools, as well as some conversations with @mdcallag and @pdillinger. These notes conclude with a collection of experiments that we may use to compare the different versions of the block cache. The takeaway is that for quick performance testing we will use cache_bench on random-reads workloads, varying the size of the set of keys that a workload samples from, which we call support size. You can find the notes here. The actual code for cache_bench experiments is here and for db_bench here.

Part 2: Implementing open-addressed hash table in FastLRUCache

An open-addressed hash table stores all of its elements in a pre-allocated array of slots. This means that each handle (i.e., all element metadata, including the key) must have fixed length. Internally, RocksDB always uses FastLRUCache using 16-byte keys, but the API doesn't reflect this. The first step was forbidding other key sizes.

The next step was to implement the array pre-allocation on top of the current hash table design, namely a chained hash table. In a chained hash table, there is a base table that stores the pointers to the chains, and is dynamically resized to maintain average per-chain load of at most 1. Assuming keys are 16 bytes long, and given an estimate of the size of the values as input, we can estimate the size of a handle. Thus, for a given total capacity, we can calculate how many slots will be needed when the table becomes full.

At this point we were ready to actually implement open addressing.

At the end of this milestone, we wanted to measure the performance of FastLRUCache, compare it to the baseline, and evaluate next steps.

Here are the slides of a presentation with preliminary measurements using cache_bench. Here are the key results (the absolute values are slightly different from the slides' because of tweaks and improvements to our experiments):

fast-lru-cache

This is for a 1GB cache size, 8kB blocks, 64 shards, 16 threads, 10M operations per thread. We used Folly distributed mutexes (which make LRUCache 20% faster than using std::mutex). The percentage next to each FastLRUCache implementation is the load factor enforced on the hash table.

The bottomline is that FastLRUCache does not substantially improve over LRUCache. Some profiling revealed that, contrarily to our intuition, pre-allocating the table and avoiding pointer chasing during lookups are a low order term in the total execution cost (at least in the random-reads-type workloads that we used), and that the biggest optimization opportunity is in locking and LRU maintenance. We concluded from this data that we had to jump straight into the development of a lock-free clock cache, as opposed to optimizing the hash table design.

Part 3: ClockCache with lock-free Lookup and Release

The first step was to implement a lock-based version of ClockCache. We jumpstarted the implementation using FastLRUCache's hash table design.

The following graph shows a comparison between ClockCache, FastLRUCache and LRUCache using cache_bench, with the same parameters as before:

clock-cache

In the workloads with higher contention (i.e., smaller support size), this version of the ClockCache outperforms LRUCache. Under low contention, it essentially matches LRUCache.

Currently the table uses two 32-bit hashes to generate the probes. These hashes are computed on every lookup. In order to avoid recomputation, we wish to use random bits from the handle's hash, a hash provided by ShardedCache that is used to select the shard. Unfortunately, this hash is only 32 bits long, which is not enough randomness to compute the probes. The following PR generalizes the ShardedCache API to allow for arbitrary hashes.

#10299 is finished but not merged yet---because it changes a mid-level API, we're holding it until we are confident that the performance penalty of recomputing the probe hashes is significant enough.

We implemented midpoint insertions in clock, to allow for different insertion priorities---index and filter blocks should be harder to evict than data blocks.

At this point we were ready to take our first steps towards a lock-free implementation. We started making only the read operations lock-free, that is, Lookup and Release. Here is a very rough design doc. In retrospect, I realize I used this doc mostly to organize my thoughts; some of the terminology and details changed in the final version. But still, it's a good outline of our work, and it has some good questions and concerns raised by the reviewers, that ended up shaping the implementation.

Here are the performance measurements of lock-free ClockCache on cache_bench:

partially-lock-free-clock-cache

Not bad!

Part 4: Production-ready, fully lock-free clock cache

@guidotag guidotag self-assigned this Jul 4, 2022
@guidotag guidotag changed the title Lock-free Clock Cache Lock-Free Clock Cache Jul 5, 2022
@gitbw95 gitbw95 self-assigned this Jul 5, 2022
@RoeyMaor
Copy link
Contributor

Hi @guidotag ,
I would love to have access to the performance comparison document:
https://docs.google.com/spreadsheets/d/1n3NiPeKJO5R7oQb41ypaJawRxLKXFlFh5MLObkaFdr8/edit?usp=sharing
seems very interesting. It has some technical issue with sending access requests

@guidotag
Copy link
Contributor Author

Hi @RoeyMaor. I'm not sure if it's okay for me to give you access to internal docs. Instead, I have just embedded the main results on this website so everyone can see them.

@RoeyMaor
Copy link
Contributor

Hi @RoeyMaor. I'm not sure if it's okay for me to give you access to internal docs. Instead, I have just embedded the main results on this website so everyone can see them.

Thanks!

@frew
Copy link

frew commented Apr 2, 2024

Hi @guidotag - curious where the clock cache implementation is at currently?

I see https://smalldatum.blogspot.com/2022/10/hyping-hyper-clock-cache-in-rocksdb.html with positive results but still see the warning at https://github.com/facebook/rocksdb/wiki/Block-Cache#hyperclockcache - we're running into some contention issues with LRUCache and wondering if it makes sense to try out ClockCache!

@qtcwt
Copy link

qtcwt commented Aug 27, 2024

Hi there,

Is there any update on this task? What are the remaining items that prevent it from being product-ready?

We encounter a very complex bottleneck due to LRU lock contention and it seems like this hyper clock cache is the only way for us.

Thanks!

@guidotag
Copy link
Contributor Author

guidotag commented Sep 3, 2024

@pdillinger there are a few folks here interested in knowing more about the status of HyperClockCache. Is it ready for production use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants