Skip to content

mv hot threads

Matthew Von-Maszewski edited this page Nov 19, 2013 · 29 revisions

Status

  • merged to master November 15, 2013
  • code complete
  • development started October 4, 2013

History / Context

The initial branch, mv-hot-threads1, contains the second of two fixes first released in the 1.4.2-turner branch. A second branch, mv-hot-threads2, contains the full implementation discussed here. Both branches have been merged to master.

Google's original leveldb contained one thread for background compactions and would perform writes to the recovery log from the user's thread. Google's leveldb stalls the user's thread during a write operation whenever the background compactions cannot keep up with the ingest rate. Riak runs multiple leveldb database (vnodes). This compounded the Google stalls such that a stall could hold multiple user write threads for minutes. Riak has several 60 second timeouts that triggered failure cases due to the stalls. A few of the failure cases were cascading in nature. A series of incremental hacks to the original leveldb began in April of 2012.

The success criteria of each hack was based upon two criteria: did a stall occur and what is the total time to ingest 500 million keys with 1024 bytes of data. The latter criteria extended over time to 750 million, to 1 billion, and to currently 2 billion keys for Riak 2.0 release. At the time of each release beginning with Riak 1.2, each hack was known to not stall on the testing platforms and produce incremental improvement to the second criteria.

The Basho leveldb hacks included:

  • add a second background thread for writing recovery data, thereby not writing to disk on user thread
  • add third thread to specialize in writing memory to level-0 files to shortcut higher level compaction blockage that might creating stalls
  • add fourth thread to specialize in merging level-0 files to level-1 to again shortcut higher level compaction blockage that might creating stalls
  • create five leveldb Env objects, each with four background threads, since compaction is largely CPU bound between CRC calculation and Snappy compression, i.e. now 20 background compaction/write threads (5 * 4 = 20)
  • create tiered locks across the five Env objects to give disk I/O priority to level-0 file creation and level-1 merging respectively
  • predict disk write rate and amount of compaction backlog so as to throttle each Write operation proportionally without causing a stall
  • prioritize the work queues for level-0 and general compactions to do the most critical compactions first

Each release starting with 1.2 successfully addressed both criteria based upon the scenarios known at the time. Yes, new stall scenarios occurred after 1.2 and later releases addresses those stalls while providing incremental throughput improvements. … and the resulting code in util/env_posix.cc and db/db_impl.cc looked pathetic ... it worked, but was not a source of pride. Additionally, it became obvious that any contested mutex (regular or read/write) caused degraded performance (likely from thread swap giving Erlang a chance to spin wait … this theory is not proven).

The development cycle for Riak 2.0 offered time to replace the incremental hacks with a fully engineered solution. The solution is hot threads. Hot threads require the user thread to use atomic operations to find and start a waiting thread. This avoids mutex contention that could steal a thread's time slice. When all threads are busy, the background task is only added to a backlog queue if that task is not "grooming" related

Hot threads is an uncommon design pattern that worked quite well under the Erlang VM (uncommon: we have never seen anyone use this design pattern before, but doubt it is a unique / original). Basho's leveldb uses the hot threads pattern to allow:

  • simultaneous compaction of multiple levels within a single database (vnode)
  • simple characterization of required versus grooming compactions
  • dedicated, specialized thread pools that are shared by all databases (vnodes)
  • more lenient application of write throttle to speed ingest rate

Branch description

blah

Clone this wiki locally