-
Notifications
You must be signed in to change notification settings - Fork 178
mv throttle 5
- merged to master -
- code complete - July 19, 2015
- development started - June 26, 2015
Basho's write throttle is an ongoing learning and tuning experience. This branch updates changes first released in branch mv-sequential-tunings. The mv-throttle-5 branch corrects a couple of bugs from the mv-sequential-tunings branch as well as improves overall throughput of leveldb under heavy loads.
The write throttle is key modification to leveldb to enhance users' performance experience. The write throttle gradually slows individual user write requests when leveldb's compactions are falling behind. The compactions fall behind when more user write requests accumulate in lower levels without being merged into higher levels. The worst case scenario is when one of two conditions occur that cause Google's original write logic to stall (block all writes) until compactions catch up. Basho's write throttle incrementally delays user write requests to give compactions time to catch up without causing sudden blockage of all writes.
Google divides the "levels" into two types: overlapping and sorted. Google's original leveldb allows .sst table files in level-0 to have overlapping key ranges. Any given key can be present in one or more of the .sst table files in level-0. Levels 1 through 6 are sorted. Each .sst table file in a given sorted level must contain a unique, contiguous key range. A given key can only exist in one possible file within each level. Different revisions of a key can exist once in every level-1 through level-6. Only the key in the lowest level is considered "current".
Basho previously redefined both level-0 and level-1 to be overlapping levels. This change increased compaction performance by gathering a larger byte count of keys in overlapping files before merging them with the first sorted level, now level-2. Basho's nickname for level-2 is the "landing level": all keys land on level-2 in sorted order for the first time.
The previous branch, mv-sequential-tunings, addressed a key shortcoming of the original throttle. The original throttle only activated when work queues accumulated too many waiting compactions. Thus the original throttle responded only to global loads of all open databases (Riak vnodes). The mv-sequential-tunings added logic to activate the write throttle when a single database accumulates too many waiting compactions.
This branch tunes how Basho's leveldb accounts for compaction backlog via write throttle penalties. The write throttle, as applied to each user's write request, has two components: estimated write speed per compaction key and number of compactions backlogged. util/throttle.cc provides the current write speed estimate as smoothed over the last hour. db/version_set.cc estimates each vnode's (leveldb database's) compaction backlog via penalty values assigned in VersionSet::UpdatePenalties(). This branch modifies VersionSet::UpdatePenalties().
This branch's performance optimization is simple: increase the penalty values for excess accumulation of files in level-1 and level-2. The penalty increase slows incoming user write requests, giving the background compaction routines more time to process the excess accumulations. There is a measurable, longterm performance improvement when the compactions operations from level-1 to level-2 (landing level) deal with smaller accumulations of files. Overall write throughput increases between 11% and 29% on test loads. There is some increase in worst case latency, but not in average or 95% latency ranges.
Portions of the logic change in VersionSet::UpdatePenalty() (db/version_set.cc) where checked into the repository without a formal branch. This was due to the timing of a customer facing issue. That original code was messy. This branch cleans up the code and tunes it. Specifically, a large #if 0/#else/#endif block is eliminated in this branch to make code flow more natural. The original, old code is within the #if 0/#else section.
This routine previously had two distinct blocks: block for overlapping levels and block for sorted levels. This tuning now uses the same calculations of penalty for both types of levels. Only the parameters for calculating the penalty differ.
The overlapping levels continue to set penalty parameters based upon the count of files in the level, not byte count of the level. The file count compares kL0_SlowdowWritesTrigger and kL0_CompactionTrigger. This branch significantly increased penalty parameters for level-1. It makes no change to level-0.
The sorted levels use the same penalty parameters for all level that exceeds m_MaxBytesForLevel. This has not changed. The branch changes the parameter calculations for level-2 (the landing level) if its total byte count falls between m_DesiredBytesForLevel and m_MaxBytesForLevel. The parameters create a sliding scale from light to heavy penalty respectively.