Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

asg017 · 2024-06-21T05:51:34Z

sqlite-vec as of v0.1.0 will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search before v1, which trades accuracy/resource usage for speed.

This issue is a general "tracking issue" for how ANN will be implemented in sqlite-vec. The open questions I have:

Which ANN index should we use?

We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.

The main options I see:

IVF: Like Faiss IndexIVF, pre-compute centroids of a training dataset, search nearest N centroids, etc.
HNSW: Should "scale" more, way more params to tune. Would be pretty complicated to implement
DiskANN: Kinda dig the simplicitly of this, especially LM-DiskANN

Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.

How should one "declare" an index?

SQLite doesn't have custom indexes, so I think the best way would be to include index info in the CREATE VIRTUAL TABLE constructor. Like:

create virtual table vec_movies(
  synopsis_embeddings float[768] INDEXED BY diskann(...)
);

or:

create virtual table vec_movies(
  synopsis_embeddings float[768] index=hnsw(...)
);

syntax heavily depends what ANN index we pick. Also how would training work?

How would they work with metadata filtering?

How do we allow bruteforce + ANN on the same table?

How do we pick between KNN/ANN in a SQL query?

The text was updated successfully, but these errors were encountered:

neilkumar · 2024-06-25T21:01:28Z

Would ustream work?

https://github.com/unum-cloud/usearch

They even have some sqlite stuff already

https://github.com/unum-cloud/usearch/blob/main/sqlite/README.md

asg017 · 2024-06-25T21:32:30Z

usearch is great! But they don't offer many "hooks" to their storage engine, which would be required for sqlite-vec. We'd want to store the index inside SQLite tables, and balance query time + random lookup times. Also the usearch SQLite functions are just scalar functions, nothing that accesses the HNSW index

Also I want to keep sqlite-vec as lightweight as possible, there's no outside dependencies and is a single .c/.h file. So I don't wanna complicate things with a C++ dependency

irowberryFS · 2024-08-29T15:27:57Z

+1 for LM-DiskANN

di-sukharev · 2024-09-02T12:24:27Z

yoo bros, lets do this index okay :))

Boscop · 2024-09-23T23:15:36Z

@irowberryFS Interestingly, libSQL uses LM-DiskANN for the vector index:
https://turso.tech/blog/approximate-nearest-neighbor-search-with-diskann-in-libsql

asg017 · 2024-10-08T18:59:13Z

For those of you who want this, I'd love to hear your thoughts on these questions:

When does the vec0 brute-force approach start to break down for you? How many vectors are you storing and how many dimensions, and how long does a KNN search take?
Have you tried scalar/binary quantization to reduce search times? Is the quality reduction good-enough for your use-case?
Would "fast inserts" be required in your applications? Some ANN indexes (ex IVF) require a "training process" before inserting real vectors that can take several seconds or minutes, and other ANN indexes can insert in "real-time" (ex DiskANN) but are individually slow as they require a graph traversal. Would your setup prefer one approach over the other?

jlarmstrongiv · 2024-10-08T21:57:42Z

When does the vec0 brute-force approach start to break down for you? How many vectors are you storing and how many dimensions, and how long does a KNN search take?

Total 20–50 billion (Max vectors * Dimensions). However, I also split the large index into hundreds of smaller indexes containing a total of up to 50 million (Max vectors * Dimensions) each and run searches on those when possible. I haven’t bothered trying the brute-force approach.

Benefits for me using sqlite-vec with an ANN Index include:

The ability to attach databases and combine indexes together and join tables across databases easily
Easily link metadata and ids with indexes
Compatibility with Node.js on the server and WASM in the browser without using separate libraries

Have you tried scalar/binary quantization to reduce search times? Is the quality reduction good-enough for your use-case?

I have tried it. The larger the vectors I start with, the better the results after quantization. However, I like to avoid quantization when possible, especially on smaller indexes. Unsurprisingly, there is a noticeable drop in quality.

Would "fast inserts" be required in your applications? Some ANN indexes (ex IVF) require a "training process" before inserting real vectors that can take several seconds or minutes, and other ANN indexes can insert in "real-time" (ex DiskANN) but are individually slow as they require a graph traversal. Would your setup prefer one approach over the other?

Must be under 15 minutes—hard limit. Must be under 30 seconds with a small index.

I am currently using HNSW. There are libraries that work for both the Node.js C++ API and WASM in the web. Painful to implement and integrate, but it works.

I am already using SQLite WASM and HNSW in the browser separately, so sqlite-vec with HNSW in the browser together would be perfection.

As nice as it would be to have a simpler implementation, HNSW works for me and is my preferred ANN index.

Being able to extract the HNSW index out of SQLite into its own file (without extracting individual vectors and rebuilding the index) is also important to me. I am able to convert the HNSW file to and from the Node.js C++ API and WASM libraries, as they are, unfortunately, saved and formatted differently.

imotai · 2024-10-09T10:38:35Z

+1 for LM-DiskANN

lsorber · 2024-10-09T20:56:15Z

When does the vec0 brute-force approach start to break down for you? How many vectors are you storing and how many dimensions, and how long does a KNN search take?

Short answer

Brute force scales up to 8000 500-word pages or 800 10-page articles on commodity hardware.

Longer answer

Assuming retrieval should take no longer than 100ms, the brute force approach can handle about 250k embeddings on commodity hardware:

import numpy as np

N = 250_000  # Number of embeddings
d = 1024  # The 'median' embedding dimension [1]

embeddings = np.random.randn(N, d).astype(np.float32)
query = np.random.randn(d).astype(np.float32)

%timeit embeddings @ query  # Core of the brute force approach

# Output on a Google Colab instance:
# 85 ms ± 9.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And assuming you store one embedding per sentence (which would be standard in multi-vector retrieval), that's the equivalent of about 8000 500-word pages, or 800 10-page articles, or 40 200-page books.

Have you tried scalar/binary quantization to reduce search times?

I see binary embeddings as an optimisation, not a substitute for floating point embeddings. This optimisation could help to scale brute force with another factor of 2 without sacrificing retrieval quality [2], but definitely not an order of magnitude. For that, we'd need an ANN index.

Would "fast inserts" be required in your applications?

I'd be happy to sacrifice some insert speed in favour of sub-100ms near-100% recall retrieval for 1–10M+ embedding datasets.

[1] https://huggingface.co/spaces/mteb/leaderboard
[2] https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors

conradev · 2024-10-31T00:13:50Z

I'd like to throw my hat into the ring for LM-DiskANN.

Brute force search works great in LanceDB because it uses a contiguous columnar format. In SQLite there is an upper bound to performance because the data is inherently fragmented. By using DiskANN, you can actually leverage this fragmentation and use the B-tree to do the lookups for nearest neighbors. It seems uniquely suited to SQLite's storage engine.

Further, If you store vectors and edges in separate columns, you could use SQLite to find nodes without edges computed and then integrate them into the graph incrementally. In my understanding, this was one of the "flaws" of sqlite-vss: the FAISS index is one big blob and SQLite is not designed to store one big blob, hence the option for faiss_ondisk.

You could even use libSQL's implementation as a starting point, you'd just need to un-libSQL-ify it.

conradev · 2024-10-31T01:38:29Z

In addition to Turso using DiskANN, it looks like Microsoft is already using this architecture in production in their Recall product: https://x.com/simonw/status/1798383577421525276

They even store vectors and edges in separate columns, like I suggested!

asg017 · 2024-10-31T20:00:32Z

@conradev agreed, I think DiskANN will eventually be the way to go!

I'm working on metadata filtering right now (#124 ), aiming to get it out in the next 2-3 weeks. After that's done, I'll be focusing on bring ANN algorithms to sqlite-vec.

Here's my rough plan for this:

Start with an IVF+kmeans index, because (after training) inserts and KNN queries are pretty quick, and should help scale sqlite-vec to at least ~1 million vectors
Add a proper DiskANN index, which won't require kmeans training

I'm wary of DiskANN because, in my experience, writes can be pretty slow due to the internal pruning process. I've tried libsql's implementations and building the initial index was pretty slow, even compared to vec0 virtual tables (which are slow compared to regular tables). Also DiskANN implementations can be pretty complex, while IVF will likely be straightforward.

I can imagine a future world where sqlite-vec has other ANN index attempts (HNSW, SPANN, etc), both those rely pretty heavily on in-memory data structures and like won't map well to SQLite shadow tables. And they're super complex!

In terms of timeline, metadata filters will come out mid-November, so I expect to work on ANN indexes November=December. Hopefully I can get some beta builds out during then, but I won't expect fully-supported ANN inside sqlite-vec until December/January.

In the meantime - if anyone else wants to share benchmark numbers of any papers/implementations of DiskANN or other ANN indexes, feel free to share in this thread!

jlarmstrongiv · 2024-11-03T14:52:43Z

Recent discussion on SPANN https://news.ycombinator.com/item?id=42028873

SeanPedersen · 2024-11-27T02:13:45Z

This is the go to repo for ANN benchmarks: https://github.com/erikbern/ann-benchmarks

andreasmhahn · 2024-12-17T10:39:57Z

+1 for LM-DiskANN

asg017 pinned this issue Jun 23, 2024

davidmezzetti mentioned this issue Sep 1, 2024

Comparison to other approaches #94

Open

mholt mentioned this issue Oct 8, 2024

KD-Tree or Ball-Tree Algorithms for kNN search? #113

Open

lsorber mentioned this issue Oct 10, 2024

Replace PyNNDescent with sqlite-vec when it adds ANN indexes, float16, and metadata filtering superlinear-ai/raglite#28

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

asg017 commented Jun 21, 2024

neilkumar commented Jun 25, 2024

asg017 commented Jun 25, 2024

irowberryFS commented Aug 29, 2024

di-sukharev commented Sep 2, 2024 •

edited

Loading

Boscop commented Sep 23, 2024

asg017 commented Oct 8, 2024

jlarmstrongiv commented Oct 8, 2024 •

edited

Loading

imotai commented Oct 9, 2024

lsorber commented Oct 9, 2024

conradev commented Oct 31, 2024 •

edited

Loading

conradev commented Oct 31, 2024

asg017 commented Oct 31, 2024

jlarmstrongiv commented Nov 3, 2024

SeanPedersen commented Nov 27, 2024

andreasmhahn commented Dec 17, 2024

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

Tracking issue: ANN (Approximate Nearest Neighbors) Index #25

Comments

asg017 commented Jun 21, 2024

Which ANN index should we use?

How should one "declare" an index?

How would they work with metadata filtering?

How do we allow bruteforce + ANN on the same table?

neilkumar commented Jun 25, 2024

asg017 commented Jun 25, 2024

irowberryFS commented Aug 29, 2024

di-sukharev commented Sep 2, 2024 • edited Loading

Boscop commented Sep 23, 2024

asg017 commented Oct 8, 2024

jlarmstrongiv commented Oct 8, 2024 • edited Loading

imotai commented Oct 9, 2024

lsorber commented Oct 9, 2024

Short answer

Longer answer

conradev commented Oct 31, 2024 • edited Loading

conradev commented Oct 31, 2024

asg017 commented Oct 31, 2024

jlarmstrongiv commented Nov 3, 2024

SeanPedersen commented Nov 27, 2024

andreasmhahn commented Dec 17, 2024

di-sukharev commented Sep 2, 2024 •

edited

Loading

jlarmstrongiv commented Oct 8, 2024 •

edited

Loading

conradev commented Oct 31, 2024 •

edited

Loading