-
Notifications
You must be signed in to change notification settings - Fork 178
mv tuning4
- merged to develop
- code complete January 16, 2014
- development started January 6, 2014
This page discusses the tunings and fixes made to both basho/leveldb and basho/eleveldb as part of the final Riak 2.0 release preparation.
basho/eleveldb is the Erlang to leveldb API. Two bugs were found in it during the integration testing. The first related to the asynchronous iterators prefetch logic. The logic assumed that prefetch was a one pass operation: pick a starting point, execute prefetch next repeatedly, and close iterator. The Riak active anti-entropy code (AAE) uses the same iterator with multiple, sequential starting points. This did not work correctly. The second bug related to the use of leveldb::ReadOptions pointers being passed within the iterator logic. The ownership of the allocated ReadOptions was confusing and the allocated memory getting release while the data was still being used.
basho/leveldb changes in the branch are all performance tunings. There are no known bugs addressed. Riak 2.0's leveldb contains the hot-threads branch which replaces all the horrible hacks to threads and locks with real code. The hot-threads worked great on fast disk arrays and required no changes to the write throttle code. Then we started testing on slow arrays. Hot-threads was too fast and the write throttle was completely inappropriate (lots of sawtooth throughput graphs). There are still sawtooth patterns on some heavy loads, but the teeth are smaller and overall throughput greater.
leveldb::ReadOptions was previously allocated on the heap, then its pointer passed. async_get and async_iterator used leveldb::ReadOptions this way. This branch changes ReadOptions to be created on a routine's stack, or embedded within other objects using it: no more passing pointers. leveldb::ReadOptions is declared in leveldb/include/leveldb/options.h. It uses the C default copy constructor. Its pointer for snapshot is be copied and therefore unsafe, but is never used beyond scope of where it is initialized. A more robust coding (in the future) would be to create a copy constructor that did not copy the snapshot pointer.
PREFETCH_STOP atom and supporting code addresses the multi-positioning of an iterator in prefetch mode. The caller is required to issue a Prefetch_Stop action after the last Prefetch action and before the next Seek action. Prefetch_Stop action "eats" any pending prefetch operations and prevents a subsequent from starting, i.e. breaks and cleans up the prior prefetch cycle.
Two functions that previously had a leveldb::ReadOptions pointer function argument now use a leveldb::ReadOptions reference. Similarly, the previous delete of a m_ReadOptions pointer is removed since the object is now embedded.
All usage of leveldb::ReadOptions pointers is converted to embedded copies of leveldb::ReadOptions. Various code clean-ups result.
Minor code adjustments to switch from leveldb::ReadOptions pointers to embedded copies.
Add PREFETCH_STOP cases to iterator retrieval and iterator_refresh logic paths.
All usage of leveldb::ReadOptions pointers is converted to embedded copies of leveldb::ReadOptions. Various code clean-ups result.
Add test cases that illustrate prefetch_next bug.
Previous leveldb branches had significant work to better manage the page cache via fadvise. A quick summary is that the code tells the operating system to flush everything out of the cache but a newly created .sst table file's metadata. That leaves more page cache space for random read buffering. Works great on fast drive arrays, especially improves performance has datasets grow beyond the size of physical RAM. Yet it worsens overall throughput on slow arrays as they reread recent level-0 and level-1 files for compactions during heavy write loads.
This one line change instructs the buffered write code to inform the operating system to keep all of the newly created level-0 file in the page cache.
A corruption test, originally written by Google, started failing when I adjusted the kL0_CompactionTrigger constant. Review of the code suggests the test was written against the wrong constant. I corrected constant and everything got happy.
The first change in this file is a parallel of the change discussed for builder.cc. The builder.cc change only impacted level-0 files. The change here applies the same page cache management to level-1 and level-2 files.
The second change is specific to Riak's active anti-entropy (AAE) feature. Riak 1.3 gave AAE special consideration for its write batches when applying the write throttle. In Riak 1.4, that special consideration became unnecessary after various performance improvements to leveldb. And AAE's special consideration reduced performance to 2i operations. Riak 1.4 removed the AAE special consideration. Riak 2.0 needs AAE's special consideration again. This line applies the consideration only to internal AAE databases, and not to user database (which is where the 2i operations occur).
Raises kL0_CompactionTrigger from Google's original 4 to 6. This constant defines when to initiate a compaction in level-0 (also in level-1 for Basho's leveldb). level-0's compaction is triggered by file count, not by total size of all files in the level. Total size is the trigger in all other levels. The alternative to raising this constant is to raise the write_buffer sizes. But changing write_buffer sizes also increases the amount of physical ram required by each leveldb database (Riak vnode). Such a memory increase would hurt some users with higher vnode counts and/or lower amounts of physical ram. Therefore the constant change is a safer solution for the Riak environment.