-
Notifications
You must be signed in to change notification settings - Fork 178
mv hot threads
- merged to master November 15, 2013
- code complete November 6, 2013
- development started October 4, 2013
The initial branch, mv-hot-threads1, contains the second of two fixes first released in the 1.4.2-turner branch. A second branch, mv-hot-threads2, contains the full implementation discussed here. Both branches have been merged to master.
Google's original leveldb contained one thread for background compactions and would perform writes to the recovery log from the user's thread. Google's leveldb stalls the user's thread during a write operation whenever the background compactions cannot keep up with the ingest rate. Riak runs multiple leveldb databases (vnodes). This compounded the Google stalls such that a stall could hold multiple user write threads for minutes. Riak has several 60 second timeouts that triggered failure cases due to the stalls. A few of the failure cases were cascading in nature. A series of incremental hacks to the original leveldb began in April of 2012.
The success criteria of each hack was based upon two criteria: did a stall occur and what is the total time to ingest 500 million keys with 1024 bytes of data. The latter criteria extended over time to 750 million, to 1 billion, and to currently 2 billion keys for Riak 2.0 release. Beginning with the Riak 1.2 release, each hack was known to not stall on the testing platforms and to produce incremental improvement to the second criteria (increased total throughput).
The Basho leveldb hacks included:
- add a second background thread for writing recovery data, thereby not writing to disk on user thread
- add third thread to specialize in writing memory to level-0 files to shortcut higher level compaction blockage that might creating stalls
- add fourth thread to specialize in merging level-0 files to level-1 to again shortcut higher level compaction blockage that might creating stalls
- create five leveldb Env objects, each with four background threads, since compaction is largely CPU bound between CRC calculation and Snappy compression, i.e. now 20 background compaction/write threads (5 * 4 = 20)
- create tiered locks across the five Env objects to give disk I/O priority to level-0 file creation and level-1 merging respectively
- predict disk write rate and amount of compaction backlog so as to throttle each Write operation proportionally without causing a stall
- prioritize the work queues for level-0 and general compactions to do the most critical compactions first
Each release starting with 1.2 successfully addressed both criteria based upon the scenarios known at the time. Yes, new stall scenarios occurred after 1.2 and later releases addresses those stalls while providing incremental throughput improvements. … and the resulting code in util/env_posix.cc and db/db_impl.cc looked pathetic ... it worked, but was not a source of pride. Additionally, it became obvious that any contested mutex (regular or read/write) caused degraded performance (likely from thread swap giving Erlang a chance to spin wait … this theory is not proven).
The development cycle for Riak 2.0 offered time to replace the incremental hacks with a fully engineered solution. The solution is hot threads. Hot threads require the user thread to use atomic operations to find and start a waiting thread. This avoids mutex contention that could steal a thread's time slice. When all threads are busy, the background task is only added to a backlog queue if that task is not "grooming" related.
Hot threads is an uncommon design pattern that works quite well under the Erlang VM (uncommon: we have never seen anyone use this design pattern before, but doubt it is unique / original). Basho's leveldb uses the hot threads pattern to allow:
- simultaneous compaction of multiple levels within a single database (vnode)
- simple characterization of required versus grooming compactions
- dedicated, specialized thread pools that are shared by all databases (vnodes)
- more lenient application of write throttle resulting in faster ingest rates
- removal of the PrioritizeWork() function in db_impl.cc (a Basho specific routine)