Add initial implementation of Proximity Map #8408

nicktobey · 2024-10-01T21:47:12Z

This adds a new message/node type: Vector Index, and a corresponding prolly-tree-based map structure: the Proximity Map

The Vector Index message currently has a subset of the Prolly Map message. A different message type was chosen to prevent them from being accidentally confused: while a Vector Index has the same fields as a Prolly Map, and has similar computed properties, their invariants and iteration order are completely different, and algorithms written for Prolly Map nodes should not accidentally operate on Vector Index nodes instead. Old versions of Dolt that do not support vector indexes should not attempt to manipulate Vector Index nodes as if they were Prolly Map nodes.

Proximity Maps resemble other Prolly Maps, but have the following invariants:

Each key must be convertible to a vector. Typically, the key is a val.Tuple, and the vector is the first value in that tuple.
The keys are arranged in the tree such that, for each of a key's parent keys (the keys that appear on the path from the root to the key), the key is closer to that parent key than any of the parent key's siblings.
The keys in a node are sorted...
...except for the first key which matches its direct parent. (This may prove to be unnecessary and could potentially be relaxed.)

Notably, while the keys of an individual node are sorted, walking all of a vector indexes keys in standard iteration order will not be sorted.

This is a useful construct because it allows for efficient proximity-based lookup, which are instrumental for quickly running "approximate nearest neighbor" algorithms.

Currently, chunk boundaries are computed completely deterministically. No rolling hash or weibull distribution is used to control chunk sizes. This was chosen because it has the simplest guarentees about chunk boundaries, which makes it easier to reason about the cost of fixup operations, and makes cascading chunk boundary changes impossible: for example, a key that marks a chunk boundary on the first two levels will always mark a chunk boundary on the first two levels even if it gets moved when fixing up the tree.

This has its downsides: the chunk sizes follow a geometric distribution with no maximum or minimum size. This may change in the future, but is acceptable as a first pass.

coffeegoddd · 2024-10-01T22:18:19Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`6b627d5`	ok	5937457

version	total_tests
`6b627d5`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-10-02T00:53:25Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`91a5a5d`	ok	5937457

version	total_tests
`91a5a5d`	5937457

correctness_percentage
100.0

…te.sh

coffeegoddd · 2024-10-04T19:46:33Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`2a75e85`	ok	5937457

version	total_tests
`2a75e85`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-10-04T19:54:31Z

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`3756ef4`	ok	5937457

version	total_tests
`3756ef4`	5937457

correctness_percentage
100.0

nicktobey · 2024-10-05T00:08:33Z

I refactored the algorithm for creating the map, outlining it into more helper functions and adding more documentation.

I also bumped go to 1.23 in order to use the new iter package. If this gets approved, we should make sure to roll out a similar bump to our other modules before merging.

There's a couple potential flaws here depending on exactly how ProximityMaps get used, and they both occur if two rows have the same vector value:

The intermediatepathMaps store the original table key tuple as a bytestring. Depending on the field types, this may not have the same ordering as the original table, which may lead to ordering issues if two keys have the same vector.
The pathMaps tables store for each row, a list of vector hashes corresponding to a path within the Proximity Tree, starting at the root. If the vectors are non-unique, we may need to store a list of secondary index keys instead. These would likely also get stored as a bytestring and may cause pathMaps to have a different iteration order than the original table.
Currently the tree level of each vector is computed by hashing the entire secondary index key. This means that changing non-vector columns in the indexed table may cause the level of these vectors to change. This could be prevented if we only hash the vector. It would mean that two rows with the same vector always have the same maximum level in the Proximity Tree, but I'm unsure if this is a problem or not.

coffeegoddd · 2024-10-05T00:29:05Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`30f8aeb`	ok	5937457

version	total_tests
`30f8aeb`	5937457

correctness_percentage
100.0

reltuk

Everything I looked at looks good. No strong opinions on the flatbuffer organization or the parameterization of the chunker factory. But I skipped all of prolly/proximity_map.go, prolly/tree/proximity_map.go, and deterministic_node_splitter.go for now, so I will need to follow up on those.

reltuk · 2024-10-07T13:08:41Z

go/store/prolly/message/vector_index.go

+	if serial.ProllyTreeNodeNumFields < pm.Table().NumFields() {
+		return nil, fb.ErrTableHasUnknownFields
+	}


Handled by gen/fb/serial/vectorindexnode.go, no?

Looks like you're right. I was copying the code from prolly_map.go, but this seems unnecessary.

coffeegoddd · 2024-10-07T19:29:53Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`9429fa9`	ok	5937457

version	total_tests
`9429fa9`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-10-22T01:17:41Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`50dc53e`	ok	5937457

version	total_tests
`50dc53e`	5937457

correctness_percentage
100.0

…te.sh

coffeegoddd · 2024-10-25T01:36:10Z

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`3e5b4d9`	ok	5937457

version	total_tests
`3e5b4d9`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-10-25T01:44:49Z

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`d539f18`	ok	5937457

version	total_tests
`d539f18`	5937457

correctness_percentage
100.0

…ncy on github.com/esote/minmaxheap)

reltuk

A few comments for now. Following up a bit on map building and the approach in general offline as well.

reltuk · 2024-11-08T09:25:27Z

go/store/prolly/tree/proximity_map.go

+	"github.com/dolthub/dolt/go/store/hash"
+	"github.com/dolthub/dolt/go/store/prolly/message"
+	"github.com/dolthub/go-mysql-server/sql"
+	"github.com/esote/minmaxheap"


These need some changes to make this compile.

+ "github.com/dolthub/go-mysql-server/sql/expression" - "sort" - "fmt" - "github.com/dolthub/dolt/go/store/prolly/message"

reltuk · 2024-11-08T09:26:08Z

go/store/prolly/tree/proximity_map.go

+	return WalkNodes(ctx, t.Root, t.NodeStore, cb)
+}
+
+// GetExact searches for an exact vector in the index, calling |cb| with the matching key-value pairs.


This seems poorly named? It seems to get the closest / lowest distance vector. And the comment is maybe confusing because it only returns the first one, even if multiple are the same distance?

reltuk · 2024-11-08T09:26:41Z

go/store/prolly/tree/proximity_map.go

+	}
+}
+
+func (t ProximityMap[K, V, O]) Has(ctx context.Context, query K) (ok bool, err error) {


Will this always return true unless the map is empty?

reltuk · 2024-11-08T09:34:33Z

go/store/prolly/tree/proximity_map.go

+			if err != nil {
+				return err
+			}
+			nextLevelNodes.Insert(keyAndDistance.key, node.GetValue(0), keyAndDistance.distance)


To a reader, this would be more transparently correct without this special case handling of the 0 element here. In particular, while I was reading this, I was thinking about how the parent node key in a regular prolly tree is the last key in the child node, not the first, and in order to verify that the behavior is correct here, I would have to go and verify that we take the first and not the last key in the a proximity map instead. It's more clear to me to just start the loop below at 0. (Admittedly potentially less efficient...)

reltuk · 2024-11-08T09:40:15Z

go/store/prolly/proximity_map.go

+		Order:        keyDesc,
+		DistanceType: distanceType,
+		Convert: func(bytes []byte) []float64 {
+			h, _ := keyDesc.GetJSONAddr(0, bytes)


Is it correct to say that the thing we're storing in the keys of the prollytree is an address to a chunk that has the floats stored as a json array of json numbers?

reltuk · 2024-11-08T09:40:58Z

go/store/prolly/proximity_map.go

+			jsonWrapper, err := doc.ToIndexedJSONDocument(ctx)
+			if err != nil {
+				panic(err)
+			}
+			floats, err := sql.ConvertToVector(jsonWrapper)
+			if err != nil {
+				panic(err)
+			}


If my understanding of the above is correct, we can't panic here. It's a not a fundamental logic error to see an I/O error fetching these chunks. These errors need to be returned through the interface and visible to a caller.

nicktobey added 2 commits October 1, 2024 14:26

Add Vector Index message type.

1723a0a

Add node splitter parameter to chunker.

83eeee3

coffeegoddd added the correctness_approved label Oct 1, 2024

nicktobey force-pushed the nicktobey/proximity-map branch from 6b627d5 to 91a5a5d Compare October 2, 2024 00:22

Add ProximityMap class.

2a75e85

nicktobey force-pushed the nicktobey/proximity-map branch from 91a5a5d to 2a75e85 Compare October 4, 2024 19:15

[ga-format-pr] Run go/utils/repofmt/format_repo.sh and go/Godeps/upda…

3756ef4

…te.sh

nicktobey added 2 commits October 4, 2024 16:56

Bump go version.

d50c707

Refactor ProximityMap build algorithm to make it more readable.

30f8aeb

nicktobey requested a review from reltuk October 4, 2024 23:57

reltuk reviewed Oct 7, 2024

View reviewed changes

Remove redundant check in vector_index.go

9429fa9

Refactor proximity map creation algorithm to accept an iterator.

50dc53e

nicktobey and others added 2 commits October 24, 2024 18:03

Remove unused FixupPromityMap function and its associated classes.

3e5b4d9

[ga-format-pr] Run go/utils/repofmt/format_repo.sh and go/Godeps/upda…

d539f18

…te.sh

Support getting the N closest vectors in a ProximityMap (Adds depende…

0af74b8

…ncy on github.com/esote/minmaxheap)

reltuk reviewed Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial implementation of Proximity Map #8408

Add initial implementation of Proximity Map #8408

nicktobey commented Oct 1, 2024

coffeegoddd commented Oct 1, 2024

coffeegoddd commented Oct 2, 2024

coffeegoddd commented Oct 4, 2024

coffeegoddd commented Oct 4, 2024

nicktobey commented Oct 5, 2024

coffeegoddd commented Oct 5, 2024

reltuk left a comment

reltuk Oct 7, 2024

nicktobey Oct 7, 2024

coffeegoddd commented Oct 7, 2024

coffeegoddd commented Oct 22, 2024

coffeegoddd commented Oct 25, 2024

coffeegoddd commented Oct 25, 2024

reltuk left a comment

reltuk Nov 8, 2024

reltuk Nov 8, 2024

reltuk Nov 8, 2024

reltuk Nov 8, 2024

reltuk Nov 8, 2024

reltuk Nov 8, 2024

Add initial implementation of Proximity Map #8408

Are you sure you want to change the base?

Add initial implementation of Proximity Map #8408

Conversation

nicktobey commented Oct 1, 2024

coffeegoddd commented Oct 1, 2024

coffeegoddd commented Oct 2, 2024

coffeegoddd commented Oct 4, 2024

coffeegoddd commented Oct 4, 2024

nicktobey commented Oct 5, 2024

coffeegoddd commented Oct 5, 2024

reltuk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coffeegoddd commented Oct 7, 2024

coffeegoddd commented Oct 22, 2024

coffeegoddd commented Oct 25, 2024

coffeegoddd commented Oct 25, 2024

reltuk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment