-
Notifications
You must be signed in to change notification settings - Fork 178
Mv bloom size limit
- merged to "master" June 17, 2013
- developer tested code checked-in June 11, 2013
- development started June 5, 2013
2i_slf_stressmix is a common basho_bench test scenario used to verify features in Riak. It always produced some timeout errors: forty to fifty in a 5 hour test. The errors increased to the thousands when executed on a machine with a FusionIO card instead of SATA / SCSI drives. This was a recent discovery.
Bloom filters are stored as part of every sst file's metadata. They are typically small, less than 20k on a 300Mbyte file. While reviewing the 2i_slf_stressmix timeout test, it became obvious that files containing mostly 2i data have bloom filters that are 5 to 10Mbyte. This large size makes a random "Get" operation take a long, long time due to the amount of meta data (bloom filter) that must be read and CRC32 calculated.
Google does not document the intent behind several sections of their leveldb code. There was a problem where the grandparent code would create many little files. Why? The impact of the small files is that they would take up a full file handle as accounted by the max_open_files. This seemed inefficient. What was missed is that this code was critical to reducing the size of the next higher level compaction.
Function Compaction::ShouldStopBefore() was modified to take key count as a parameter. This becomes a second test as to whether a newly created .sst file should be terminated before reaching its size limit. The first test, the grandparent test, had been disabled in a previous branch. That first test is now reinstated.
The key test is hard coded to 75,000 keys. This number is based upon back calculating the number of keys that will fit within a 100K bloom filter. 100K being an estimated max for reasonable meta data load during a random Get that forces the read of a closed file. (Yes, there are now plans to better optimize how the code opens an .sst table.)
The grandparent test was disable in hopes of better "packing" the file cache. Small files created by the grandparent test caused the file cache, limited by max_open_files, to be poorly utilized, i.e. thrashing occurred. The side effect of creating longer compactions at the next level was not understood. Another recent branch completely changed how the accounting for the file cache worked. The file cache now counts bytes allocated by file objects, not the number of file objects. This allows small files to work as efficiently as larger files. Therefore the reason to disable the grandfather test is eliminated and the side effect of better higher level compactions restored.
The problem is simply stated: the incoming write operations occur faster than compactions, this leads to the Google code in DBImpl::MakeRoomForWrite() to block. The block can stall new operations for 10s of seconds. basho_bench is tuned to "timeout" at 60 seconds. The timeouts occur often in the 2i_slf_stressmix test without this branch.
The 2i_slf_stressmix provides heavy database iterations, interlaced with new data Write operations. The combined load did not fit the existing write throttle model.
Several strategies were coded, and some are part of earlier commits to this branch. Prior to this branch, the write throttle combined actual write throughput timings with estimates of future work. The future work was partially estimated in VersionSet::Finalize() routine. This routine provided work estimates, call "penalty" factors, for the write throttle calculations. Several tunings of the penalties started to work for 2i, but reduced throughput of other tests 10% or more.
The final solution was to quit "estimating work" as compactions began to approach Google's stall point. The code in VersionSet::Finalize() switches to a harsh logarithmic increase in penalty values. This quickly elevates the write throttle, but does not "stall" operations. The overall throughput is now maintained over time with quick slowing when work backs up unpredictably ... again, without stalling.