-
Notifications
You must be signed in to change notification settings - Fork 178
mv hot threads
- merged to master November 15, 2013
- code complete November 6, 2013
- development started October 4, 2013
The initial branch, mv-hot-threads1, contains the second of two fixes first released in the 1.4.2-turner branch. A second branch, mv-hot-threads2, contains the full implementation discussed here. Both branches have been merged to master.
Google's original leveldb contained one thread for background compactions and would perform writes to the recovery log from the user's thread. Google's leveldb stalls the user's thread during a write operation whenever the background compactions cannot keep up with the ingest rate. Riak runs multiple leveldb databases (vnodes). This compounded the Google stalls such that a stall could hold multiple user write threads for minutes. Riak has several error situations based upon 60 second timeouts that triggered failure cases due to the stalls. A few of the failure cases were cascading in nature. A series of incremental hacks to the original leveldb to eliminate stalls began in April of 2012.
The success of each hack was based upon two criteria: did a stall occur and what is the total time to ingest 500 million keys with 1024 bytes of data. The latter criteria extended over time to 750 million, to 1 billion, and to currently 2 billion keys for Riak 2.0 release. Beginning with the Riak 1.2 release, each hack was known to not stall on the testing platforms and to produce incremental improvement to the second criteria (increased total throughput).
The Basho leveldb hacks included:
- add a second background thread for writing recovery data, thereby not writing to disk on user thread
- add a third thread specialized in writing memory to level-0 files to shortcut higher level compaction blockage that might create stalls
- add a fourth thread specialized in merging level-0 files to level-1 to again shortcut higher level compaction blockage that might create stalls
- create five leveldb Env objects, each with four background threads, since compaction is largely CPU bound between CRC calculation and Snappy compression, i.e. now 20 background compaction/write threads (5 * 4 = 20)
- create tiered locks across the five Env objects to give disk I/O priority to level-0 file creation and level-1 merging respectively
- predict disk write rate and amount of compaction backlog so as to throttle each Write operation proportionally to prevent a stall scenario
- prioritize the backlog queues for level-0 and general compactions to do the most critical compactions first
Each release starting with 1.2 successfully addressed both criteria based upon the scenarios known at the time. Yes, new stall scenarios occurred after 1.2 and later releases addressed those stalls while providing incremental throughput improvements. … and the resulting code in util/env_posix.cc and db/db_impl.cc looked pathetic ... it worked, but was not a source of pride. Additionally, it became obvious that any contested mutex (regular or read/write) caused degraded performance (likely from thread swap giving Erlang a chance to spin wait … this theory is not proven).
The development cycle for Riak 2.0 offered time to replace the incremental hacks with a fully engineered solution. The solution is hot threads. Hot threads require the user thread to use atomic operations to find and start a waiting thread. This avoids mutex contention that could steal a thread's time slice. When all threads are busy, the background task is only added to a backlog queue if that task is not "grooming" related.
Hot threads is an uncommon design pattern that works quite well under the Erlang VM (uncommon: we have never seen anyone use this design pattern before, but doubt it is unique / original). Basho's leveldb uses the hot threads pattern to allow:
- simultaneous compaction of multiple levels within a single database (vnode)
- simple characterization of required versus grooming compactions
- dedicated, specialized thread pools that are shared by all databases (vnodes)
- more lenient application of write throttle resulting in faster ingest rates
- removal of the PrioritizeWork() function in db_impl.cc (a Basho specific routine)
Much of the hot-threads code originated in the basho/eleveldb code base in the Janurary-February 2013 time frame. Source files and individual classes copied from basho/eleveldb were modified to better suit the leveldb usage. Later, it is likely that eleveldb will shift to using the classes within basho/leveldb to promote code reuse and ease maintenance.
The contents of this file come from basho/eleveldb's c_src/detail.hpp. It provides platform independent wrappers for various atomic functions used in the hot-threads implementation. Initially, only Solaris and gcc (Linux/OSX) environments needed support. The need to support other environments could arise in the future.
The gcc compiler's atomics implementation lends well to the use of C++ templates for many of the data types. Solaris's implementation does not. The latter is why there are many type specific implementations.
There was discussion about using the C++ extension classes for atomic operations. This was rejected due to lack of knowledge concerning the level of implementation / support within our customer installations. It is likely a decision worth revisiting in the future.