-
Notifications
You must be signed in to change notification settings - Fork 0
ToplingZipTable
In ToplingZipTable, there are two core concepts: CO-Index and PA-Zip
- CO-Index: That is, Compressed Ordered Index, which maps a Key of type ByteArray to an integer ID, and this ID is used to access the corresponding Value in PA-Zip.
- PA-Zip: Point Accessible Zip, which can be regarded as an abstract array. The core function is to use the ID as the subscript of the abstract array to access the elements of the abstract array. Of course, these elements are compressed and stored.
CO-Index and PA-Zip together form a logical map<Key, Value>
Here, Key is the InternalKey: {UserKey, Seq, OpType} triplet in RocksDB.
The most typical implementation of CO-Index in ToplingDB is NestLoudsTrie, and the most typical implementation of PA-Zip in ToplingDB is DictZipBlobStore, both of which are memory compressed, that is, their forms in memory are compressed, and all Search and read operations are performed in the form of memory compression. And the compression rate of CO-Index + PA-Zip
is very high, far better than BlockBasedTable + zstd.
The calculation overhead of ToplingZipTable compression is relatively large (about twice that of zstd), so in ToplingDB, it is mainly configured in the lower layer of LSM through DispatchTable, and compression is performed through distributed compact.
configuration item | type | default | explanation |
---|---|---|---|
localTempDir | string | /tmp | Temporary files will be used during the creation of ToplingZipTable SST, here specify the temporary file directory |
enableStatistics | bool | true | Whether to perform performance measurement on SST's Get operation |
keyPrefixLen | int | 0 | DB systems such as MyRocks use fixed-length prefixes (generally 4) to distinguish different tables or indexes. Different tables or indexes generally have different data characteristics, so different compression schemes should be used |
checksumLevel | int | 0 | 0: do not enable checksum 1: enable checksum for metadata only 2: separately checksum for each piece of data 3: checksum for the entire file |
warmupLevel | enum | kIndex | Warm up files (load into memory) when opening SST kNone : do not warm up kIndex: warm up index kValue: Warm up the entire file (including value content) |
debugLevel | int | 0 | mainly for testing |
sampleRatio | float | 0.03 | Sampling rate, since Value's global compression requires sampling |
minPreadLen | int | 0 | When page faults are frequent, the performance of using pread will be better, because fewer IOs are required, and the overhead of creating PTE is avoided. This parameter is used to control when to use pread < 0 : do not use pread == 0 : always use pread > 0 : use pread when greater than this value |
minPrefetchPages | int | 0 | When reading each value from mmap, if the size of the value in the file is large (at least across the Page boundary), how many pages are pre-read at a time to alleviate frequent page faults during random access. 0 disables the feature, since calls to MADV_POPULATE_READ also have overhead, which is unnecessary when page faults are low |
builderMinLevel | int | 0 | In the LSM, this layer is used as the boundary, the top layer does not use ToplingZipTable, and the bottom layer (including) uses ToplingZipTable. This is because when using distributed compaction, if the distributed compact fails and falls back to execute local compact, it will consume the computing resources of the DB node, and the computing resources of the DB node are in short supply, so at this time we hope to use Create other TableFactory (such as SingleFastTable) with lower overhead, this parameter is mainly for this purpose |
indexType | string | Mixed_XL_ 256_32_FL |
The default NestLoudsTrie type, NestLoudsTrie can use different types of Rank-Select implementations, mainly for testing, use the default value normally |
indexNestLevel | int | 3 | Maximum number of nesting levels for NestLoudsTrie index |
indexNestScale | int | 8 | Every time NestLoudsTrie nests deeper, the size of the deeper layer will decrease, and when it is reduced to a fraction of the outermost layer, the nesting will stop
|
indexCacheRatio | float | 0 | NestLoudsTrie's underlying Select operation can be accelerated by Cache. This is the Cache ratio, generally set below 0.01, then the search can be accelerated by about 10%. More Cache's acceleration effect is not obvious |
indexTempLevel | int | 0 | When creating a NestLoudsTrie, use temporary files to reduce memory usage. The more temporary files you use, the less memory you need
|
indexMemAsResident | bool | false | Make index resident in memory |
indexMemAsHugePage | bool | false | Make index use hugepage |
speedupNestTrieBuild | bool | true | The optimization parameters when NestLoudsTrie is created, just keep the default |
optimizeCpuL3Cache | bool | true | Value's global compression uses a multi-threaded pipeline. The memory dictionary of this compression algorithm is large, and the memory access in the dictionary is very random. This option allows the data of a single SST to be compressed at the same time as much as possible to improve the performance of the CPU L3 Cache. Utilization, this parameter can be left as default |
bytesPerBatch | int | 256K | When compressing Value, each Task in the Pipeline is a Batch, and the total size of all values in a single Batch |
recordsPerBatch | int | 500 | When compressing Value, the upper limit of the number of all values in a single batch |
entropyAlgo | enum | kNoEntropy | Use the global dictionary to compress the Value and then use entropy encoding to compress it again. The compression benefit of entropy encoding is small, but the decompression/reading overhead is very high, so it is disabled by default, and it is not recommended to enable it. Other optional values: kHuffman, kFSE |
offsetArrayBlockUnits | int | 0 | The positioning of the variable-length Value is realized through the Offset array, and the length is calculated through the difference between adjacent Offsets. This array can be compressed using PForDelta. This option is used to configure the number of elements in each PForDelta compression block. 0 means no compression; for compression, 128 is preferred, and it can also be set to 64, and cannot be set to other values |
minDictZipValueSize | int | 30 | When the average value length is less than this value, no compression |
keyRankCacheRatio | float | 0 | Used to speed up ApproximateOffsetOf, set to 0 means disabled, non-zero means Cache from the overall sampling rate |
acceptCompressionRatio | float | 0.8 | After compression/before compression, when the compression ratio of the Value represented is too poor, the compression is abandoned |
nltAcceptCompressionRatio | float | 0.4 | When the compression ratio of NestLoudsTrie is too poor, give up using this index and use other types of indexes instead |
softZipWorkingMemLimit hardZipWorkingMemLimit smallTaskMemory |
uint64 | 16G 32G 1.2G |
When multiple Compactions are executed concurrently, each requires memory, so a limit is required. When the expected memory usage exceeds the soft limit, a single new task whose expected memory usage does not exceed smallTaskMemory is allowed to execute. When the hard limit is reached, no new task is allowed to execute |
fileWriterBufferSize | int | 128K | write buffer size |
fixedLenIndexCacheLeafSize | int | 512 | For FixedLenKeyIndex, configure the leaf node size of its double array query cache, the larger the leaf node, the smaller the cache, just keep the default |
ToplingZipTable is configured through SidePlugin. In the (yaml) configuration file, an example is as follows:
TableFactory:
zip:
class: ToplingZipTable
params:
localTempDir: "/dev/shm/tmp"
indexType: Mixed_XL_256_32_FL
indexNestLevel: 3
indexNestScale: 8
indexTempLevel: 0
indexCacheRatio: 0
warmupLevel: kIndex
compressGlobalDict: false
optimizeCpuL3Cache: true
enableEntropyStore: false
offsetArrayBlockUnits: 128
sampleRatio: 0.01
checksumLevel: 0
entropyAlgo: kNoEntropy
debugLevel: 0
softZipWorkingMemLimit: 16G
hardZipWorkingMemLimit: 32G
smallTaskMemory: 1G
minDictZipValueSize: 30
keyPrefixLen: 0
minPreadLen: 64
For complete configuration, please refer to lcompact_enterprise.yaml
In DispatcherTable, multiple ToplingZipTables can be configured with different compression options, for example:
TableFactory:
lightZip:
class: ToplingZipTable
params:
localTempDir: "/dev/shm/tmp"
indexNestLevel: 3
indexNestScale: 8
minDictZipValueSize: 10M
Set minDictZipValueSize to a large value, so that the data which average length of a single value is less than 10M
will not be compressed, which is suitable for the upper-level data in LSM (such as level 2,3). No compression can not only reduce the CPU consumption in Compact, but also greatly improve the read performance, because it can not only save decompression operation, but also use ZeroCopy to directly return the mmap memory where the value is stored in the SST to the user code.