Skip to content

mv throttle 5

Matthew Von-Maszewski edited this page Jun 29, 2015 · 20 revisions

Status

  • merged to master -
  • code complete - June 29, 2015
  • development started - June 26, 2015

History / Context

The write throttle in Basho's edits to leveldb is an ongoing learning and tuning experience. This branch adds subtle tunings to a larger change create in branch mv-sequential-tunings.

The write throttle gradually slows user write requests when leveldb's compactions are falling behind. The compactions fall behind when more user write requests accumulate in lower levels without being merged into higher levels. The worst case scenario is when one of two conditions occur that cause Google's original write logic to stall (block all writes) until compactions catch up. Basho's write throttle incrementally delays user write requests to give compactions time to catch up without causing sudden blockage of all writes.

Google divides the "levels" into two types: overlapping and sorted. Google's original leveldb allows .sst table files in level-0 to have overlapping key ranges. Any given key can be present in one or more of the .sst table files in level-0. Levels 1 through 6 are sorted. Each .sst table file in a given level must contain a unique, contiguous key range. A given key can only exist in one possible file within each level. A key can exist once in every level-1 through level-6. Only the key in the lowest level is considered "current".

Basho previously redefined both level-0 and level-1 to be overlapping levels. This change increased compaction performance by gathering a larger byte count of keys in overlapping files before merging them with the first sorted level, now level-2. Basho's nickname for level-2 is the "landing level": all keys land on level-2 in sorted order for the first time.

This branch tunes how Basho's leveldb accounts for compaction backlog via write throttle penalties. The write throttle, as applied to each user's write request, has two components: estimated write speed per compaction key and number of compactions backlogged. util/throttle.cc provides the current write speed estimate as smoothed over the last hour. db/version_set.cc estimates each vnode's (leveldb database's) compaction backlog via penalty values assigned in VersionSet::UpdatePenalties(). This branch modifies VersionSet::UpdatePenalties().

This branch's performance optimization is simple: increase the penalty values for excess accumulation of files in level-1 and level-2. The penalty increase slows incoming user write requests, giving the background compaction routines more time to process the excess accumulations. There is a measurable, longterm performance improvement when the compactions operations from level-1 to level-2 (landing level) deal with smaller accumulations of files. Overall write throughput increases between 11% and 29% on test loads. There is some increase in worst case latency, but not in average or 95% latency ranges.

Branch Description

Portions of the logic change in VersionSet::UpdatePenalty() (db/version_set.cc) where checked into the repository without a formal branch. This was due to the timing of a customer facing issue. That original code was messy. This branch cleans up the code and tunes it. Specifically, a large #if 0/#else/#endif block is eliminated in this branch to make code flow more natural. The original, old code is within the #if 0/#else section.

db/version_set.cc, routine UpdatePenalty()

This routine previously had two distinct blocks: block for overlapping levels and block for sorted levels. This tuning now uses the same calculations of penalty for both types of levels. Only the parameters for calculating the penalty differ.

The overlapping levels continue to set penalty parameters based upon the count of files in the level, not byte count of the level. The file count compares kL0_SlowdowWritesTrigger and kL0_CompactionTrigger. This branch significantly increased penalty parameters for level-1. It makes no change to level-0.

The sorted levels use the same penalty parameters for all level that exceeds m_MaxBytesForLevel. This has not changed. The branch changes the parameter calculations for level-2 (the landing level) if its total byte count falls between m_DesiredBytesForLevel and m_MaxBytesForLevel. The parameters create a sliding scale from light to heavy penalty respectively.

Clone this wiki locally