-
Notifications
You must be signed in to change notification settings - Fork 178
mv throttle 5
- merged to master -
- code complete - July 19, 2015
- development started - June 26, 2015
Basho's write throttle is an ongoing learning and tuning experience. This branch updates changes first released in branch mv-sequential-tunings. The mv-throttle-5 branch corrects a couple of bugs from the mv-sequential-tunings branch as well as improves overall throughput of leveldb under heavy loads.
The write throttle is key modification to leveldb to enhance users' performance experience. The write throttle gradually slows individual user write requests when leveldb's compactions are falling behind. The compactions fall behind when more user write requests accumulate in lower levels without being merged into higher levels. The worst case scenario is when one of two conditions occur that cause Google's original write logic to stall (block all writes) until compactions catch up. Basho's write throttle incrementally delays user write requests to give compactions time to catch up without causing sudden blockage of all writes.
Google divides the "levels" into two types: overlapping and sorted. Google's original leveldb allows .sst table files in level-0 to have overlapping key ranges. Any given key can be present in one or more of the .sst table files in level-0. Levels 1 through 6 are sorted. Each .sst table file in a given sorted level must contain a unique, contiguous key range. A given key can only exist in one possible file within each level. Different revisions of a key can exist once in every level-1 through level-6. Only the key in the lowest level is considered "current".
Basho previously redefined both level-0 and level-1 to be overlapping levels. This change increased compaction performance by gathering a larger byte count of keys in overlapping files before merging them with the first sorted level, now level-2. Basho's nickname for level-2 is the "landing level": all keys land on level-2 in sorted order for the first time.
A previous branch, mv-sequential-tunings, addressed a key shortcoming of Basho's original throttle. The original throttle only activated when work queues accumulated too many compactions across many databases. Thus the original throttle responded only to global loads of all open databases (Riak vnodes). A single database with a focused write load, such as a Riak handoff, would not activate the write throttle. The single database would then experience stalls. The stalls could last long enough to trigger timeout errors in network connections, which in turn failed the Riak handoff process. The mv-sequential-tunings branch added logic to activate the write throttle when a single database accumulates too many waiting compactions.
This branch fixes two bugs in mv-sequential-tunings' logic and changes the compaction selection rules. The selection rule changes result in better throughput in all test work loads. There are also adjustments to the penalty computation that result in smaller worst-case latencies.
The first fix is to count the move of a .sst table file via rename as a compaction. The global write throttle estimates the average amount of backlog by dividing the size of the waiting queue by the number of compactions. There were no move operations prior to the mv-sequential-tunings branch. So the requirement to account for a "move" as a "compaction" was not previously needed, and then missed when "move" was activated.
The second fix is to apply to always use the larger of the global throttle value or the unadjusted throttle value when a database (vnode) knows it is behind in compactions. The mv-sequential-tunings branch would always use the unadjusted throttle which could be significantly smaller than the global throttle during heavy loads. The selection of the wrong throttle value lead to erratic latencies and reduced throughput.
Basho's original compaction selection algorithm grew out of Google's algorithm. Both algorithms tended to give lower levels priority over higher levels. This tendency inflated within Basho's algorithm with the addition of simultaneous compaction support. Under long periods of heavy load, especially load targeting only one or a few databases (vnodes), the middle levels would over accumulate .sst table files. An over accumulation in level-2, the landing level, caused extremely inefficient compaction behavior. It could take the compaction threads hours to correct the inefficient .sst table file positioning within the levels once the write load ended. This branch now blocks .sst table files from compacting upward if the receiving level is larger than its "desired" size. The blocking effectively give the higher levels priority in compaction selection when they accumulate too many files. The new compaction selection algorithm also better supports tiered storage arrays when the "slow" array is considerably slower than the "fast" array and the "fast" array is quickly reaching its storage capacity.
Note: Portions of the logic change in VersionSet::UpdatePenalty() (db/version_set.cc) where checked into the repository without a formal branch. This was due to the timing of a customer facing issue. That direct checkin had code that was messy. Specifically, a large #if 0/#else/#endif block is eliminated in this branch to make code flow more natural. The original, old code was previously within the #if 0/#else section.
"if" block govern by IsTrivialMove() causes a rename operation to quickly move a .sst file from one level's directory to another. This move operation needs to count as "compaction" when the throttle compares work queue backlog to compaction completion rate. The new call to SetThrottleWriteRate() causes the compaction count to increase without changing the accumulation of time spent compacting and number of keys compacted, i.e. first two call parameters are zero.
The SetThrottleWriteRate() call was adjusted. The fourth parameter is eliminated with this branch. The previous tracking of work queue size was done incorrectly. The work queue size should be a snapshot per tracking period, not an accumulation of size seen at the time of each compaction. Snapshot code change is discussed as part of throttle.cc changes.
CheckAvailableCompactions() is a new Basho specific API call. This API call allows a completing compaction thread to suggest that other databases recheck if they have an eligible grooming compaction. Grooming compactions do NOT accumulate on the thread work queues. So any given database might have previously had a grooming compaction rejected. Previously there was no "retry" for a grooming compaction. This API call creates a clean means of creating that "retry".
DB::CheckAvailableCompactions() provides a default implementation for base class. DBImpl::CheckAvailableCompactions() provides the live implementation used in leveldb.
Add CheckAvailableCompactions() declaration for implementation class.
Correct a test's kMaxFiles constant to account for Basho's leveldb having two overlapped levels instead of just one.
Correct the Basho specific performance counter to accumulate number of files searched on a level, not how many times a level gets searched.
This routine embodies the compaction selection algorithm. The key adjustment is to have each decision also verify that the size of the destination level, i.e. "level+1", is currently within size tolerance. This branch uses m_DesiredBytesForLevel for the decision. m_MaxBytesForLevel seems more intuitive, but once any level gets above that value a write throttle penalty occurs and the performance goes down. (m_DesireBytesForLevel + m_MaxBytesForLevel)/2 might be a better choice, but requires future testing to verify. That test would allow grooming compactions to potential clear the destination level naturally instead of via a blockage of the lower level.
This routine previously had two distinct blocks: block for overlapping levels and block for sorted levels. This tuning now uses the same calculations of penalty for both types of levels. Only the parameters for calculating the penalty differ.
The parameters used became progressively smaller during the testing of this branch. There might now be an even simpler way to code the penalty, but not worthy of the effort at this time.
The overlapping levels continue to set penalty parameters based upon the count of files in the level, not byte count of the level. The file count compares kL0_SlowdowWritesTrigger and kL0_CompactionTrigger. This branch significantly changed the penalty parameters for level-1. The parameters started high and gradually settled on lower values due to testing. It makes no change to level-0.
The sorted levels use the same penalty parameters for all level that exceeds m_MaxBytesForLevel. This has not changed. The branch changes the parameter calculations for level-2 (the landing level) if its total byte count falls between m_DesiredBytesForLevel and m_MaxBytesForLevel. The parameters create a sliding scale from light to slightly heavy penalty respectively.