Skip to content

Key Comparator

ZengJingtao edited this page Jan 30, 2023 · 2 revisions

Background

We know that whether it is the std::map, std::sort of the standard library, or the qsort, bsearch ... of the C lib, there is an essential Comparator, which defines the order of the Keys. In general, the default Comparator is "lexicographic order by byte", such as the default comparison operation of std::string, and memcmp of C lib.

The Comparator of RocksDB (from LevelDB) is not fundamentally different from other Comparators, except that it defines an additional Name for the Comparator. The greatest significance of this Name is that it specifies the behavior of the Comparator. If there is only one Comparator function pointer, we have no way of knowing what behavior the implementation of this function corresponds to. Another significance of Name is that it can be persisted.

RocksDB's own default Comparator is "lexicographic order by byte", and its name is leveldb.BytewiseComparator. As long as we see this name, we know that no matter how it is implemented, its behavior must be "lexicographic order by byte".

Topling

Regardless of whether it is Topling CSPPMemTab or Topling SST, the order of Keys is and must be "lexicographical order by byte". We only need to judge the Name of Comparator to know whether to use Topling's MemTable/SST or fallback to RocksDB's own MemTable/SST.

On the surface, this limits flexibility, but in fact, this flexibility relying on Comparator is unnecessary most of the time, because we can always use other means to achieve this flexibility, such as By encoding the Key, the encoded order is byte order, and this order is equivalent to the order of the Comparator before encoding:

  • MongoDB has an encoder that implements this encoding, so that in MongoRocks, only the default leveldb.BytewiseComparator of RocksDB is needed
  • In MyRocks (MyRocks + RocksDB), such an encoder is also implemented, which encodes various Index Keys into "lexicographical order by byte"
    • For the Collation/Comparator that ignores case, the Key is uniformly encoded (normalized) to all uppercase, and the information needed to decode each Key (restore to the original Key) is saved in Value (the Value of KeyValue).
    • Although the MyRocks (default) Comparator is "bytewise lexicographically", another Name is used instead of leveldb.BytewiseComparator

Reversed lexicographic order by byte

The most annoying thing is that in order to optimize the performance of reverse scanning (such as ORDER BY ... DESC), MyRocks provides a "Reversed lexicographic order by byte" Comparator. The reason is a bit ridiculous but helpless:

  • The reverse scanning (Iterator.Prev) performance of RocksDB's default BlockBasedTable is too poor
  • By using the "Reversed lexicographic order by byte" Comparator, the reverse scan becomes a forward scan (Iterator.Next)...

The natural order of our (Topling) MemTable and SST is "forward lexicographical order by byte", and neither has the problem of low performance of reverse scanning (Iterator.Prev). But if we want to be 100% compatible with MyRocks, we still have to Support "Reversed lexicographic order by byte". Although the principle is similar to that of "forward lexicographical order by byte", the implementation of "Reversed lexicographic order by byte" is very cumbersome, and the implementation of code is even more difficult to be elegant.

In fact, even if the order is to be reversed, it is easy to use "forward lexicographical order by byte":

  • For fixed-length fields (various integers, floating-point numbers...), you only need to invert each byte when encoding
  • For variable-length fields (mainly variable-length String), ORDER BY ... DESC is a very strange requirement, no need to optimize

Internal Comparator

RocksDB is based on MVCC. It adds a (SeqNum, OpType) to the Key provided by the user. SeqNum always increases, and OpType is an enumeration {PUT, DELETE, MERGE,...}.

Therefore, its Comparator is also divided into User Comparator and Internal Comparator. User Comparator is the Comparator provided by the user. Internal Comparator is only based on User Comparator, adding the comparison of SeqNum and OpType.

In fact, for "forward lexicographical order by byte", use the idea of ​​encoding) to encode SeqNum+OpType, and the implementation of Internal Comparator and User Comparator is exactly the same!