Add initial implementation of Proximity Map #8408
Open
+1,443
−16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a new message/node type: Vector Index, and a corresponding prolly-tree-based map structure: the Proximity Map
The Vector Index message currently has a subset of the Prolly Map message. A different message type was chosen to prevent them from being accidentally confused: while a Vector Index has the same fields as a Prolly Map, and has similar computed properties, their invariants and iteration order are completely different, and algorithms written for Prolly Map nodes should not accidentally operate on Vector Index nodes instead. Old versions of Dolt that do not support vector indexes should not attempt to manipulate Vector Index nodes as if they were Prolly Map nodes.
Proximity Maps resemble other Prolly Maps, but have the following invariants:
val.Tuple
, and the vector is the first value in that tuple.Notably, while the keys of an individual node are sorted, walking all of a vector indexes keys in standard iteration order will not be sorted.
This is a useful construct because it allows for efficient proximity-based lookup, which are instrumental for quickly running "approximate nearest neighbor" algorithms.
Currently, chunk boundaries are computed completely deterministically. No rolling hash or weibull distribution is used to control chunk sizes. This was chosen because it has the simplest guarentees about chunk boundaries, which makes it easier to reason about the cost of fixup operations, and makes cascading chunk boundary changes impossible: for example, a key that marks a chunk boundary on the first two levels will always mark a chunk boundary on the first two levels even if it gets moved when fixing up the tree.
This has its downsides: the chunk sizes follow a geometric distribution with no maximum or minimum size. This may change in the future, but is acceptable as a first pass.