doc

asg017 · Nov 20, 2024 · 6d90d1f · 6d90d1f
1 parent 71f1763
commit 6d90d1f
Show file tree

Hide file tree

Showing 2 changed files with 238 additions and 309 deletions.
diff --git a/site/features/vec0.md b/site/features/vec0.md
@@ -1,12 +1,240 @@
 # `vec0` Virtual Table
 
-- primary keys
-- vector column definitions
-  - float/bit/int, dimension required
-  - distnace_metric
-  - chunk_size?
-- insert, updates, delete
-- fullscan, point, knn
-- distance and k hidden columns
-- `rowid in (...)`
-- joins/metadata example?
+## Metadata in `vec0` Virtual Tables
+
+There are three ways to store non-vector columns in `vec0` virtual tables:
+metadata columns, partition keys, and auxiliary columns. Each options has their
+own benefits and limitations.
+
+```sql
+create virtual table vec_chunks using vec0(
+  document_id integer partition key,
+  contents_embedding float[768],
+
+  -- partition key column, denoted by 'partition key'
+  user_id integer partition key,
+
+  -- metadata column, appears as normal column definition
+  label text,
+
+  -- auxiliary column, denoted by '+'
+  +contents text
+);
+```
+
+A quick summary of each option:
+
+| Column Type       | Description                                                             | Benefits                                             | Limitations                                                                                                           |
+| ----------------- | ----------------------------------------------------------------------- | ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
+| Metadata columns  | Stores boolean, integer, floating point, or text data alongside vectors | Can be included in the `WHERE` clause of a KNN query | Slower full scan, slightly inefficient with long strings (`> 12` characters)                                          |
+| Auxiliary columns | Stores any kind of data in a separate internal table                    | Eliminates need for an external `JOIN`               | Cannot appear in the `WHERE` clause of a KNN query                                                                    |
+| Partition Key     | Internally shards vector index on a given key                           | Make selective queries much faster                   | Can cause oversharding and slow KNN if not used carefully. Should be +100's of vectors per unique partition key value |
+
+### Metadata Columns {#metadata}
+
+Metadata columns are extra "regular" columns that you can include in a `vec0`
+table definition. These columns will be indexed along with declared vector
+columns, and allow you to include extra `WHERE` constraints during KNN queries.
+
+```sql
+create virtual table vec_movies using vec0(
+  movie_id integer primary key,
+  synopsis_embedding float[1024],
+  genre text,
+  num_reviews int,
+  mean_rating float,
+  contains_violence boolean
+);
+```
+
+In the `vec0` constructor, the `genre`, `num_reviews`, `mean_rating`, and
+`contains_violence` columns are metadata columns, with their specified type.
+
+A sample KNN query on this table could look like:
+
+```sql
+select *
+from vec_movies
+where synopsis_embedding match '[...]'
+  and k = 5
+  and genre = 'scifi'
+  and num_reviews between 100 and 500
+  and mean_rating > 3.5
+  and contains_violence = false;
+```
+
+The first two conditions in the `WHERE` clause (`synopsis_embedding match` and
+`k = 5`) denote that the query in a KNN query. The other conditions are metadata
+constraints, that `sqlite-vec` will recognize and apply during the KNN
+calculation. In other words, for the above query, a maximum of 5 rows would be
+returned, all of which would fit under all the `WHERE` constraints for their
+metadata column values.
+
+#### Metadata Column Declaration
+
+Metatadata columns are declared in the `vec0` constructor just like regular
+column definitions, with the column name first then the column type.
+
+Only the following column types are supported in metadata columns. All these
+columns are strictly typed.
+
+- `TEXT` for text and strings
+- `INTEGER` for 8-byte integers
+- `FLOAT` for 8-byte floating-point numbers
+- `BOOLEAN` for 1-bit `0` or `1`
+
+Other column types may be supported in the future. Column type names are case
+insensitive.
+
+Additional column constraints like `UNIQUE` or `NOT NULL` are not supported.
+
+A maximum of 16 metadata columns can be declared in a `vec0` virtual table.
+
+#### Supported operations
+
+Metadata column `WHERE` conditions in a KNN query will only work on the
+following operators:
+
+- `=` Equals to
+- `!=` Not equals to
+- `>` Greater than
+- `>=` Greater than or equal to
+- `<` Less than
+- `<=` Less than or equal to
+
+Using any other operator like `IS NULL`, `LIKE`, `GLOB`, `REGEXP`, or any scalar
+function will result in an error or incorrect results.
+
+Boolean columns only support `=` and `!=` operators.
+
+### Partition Key Columns {#partition-keys}
+
+Partition key columns allow one to internally shard a vector indexed based on a
+given key. Any `=` constraint in a `WHERE` clause on a partition key column will
+
+For example, say you're performing vector search on a large dataset of
+documents. However, each document belongs to a user, and users can only search
+their own documents. It would be wasteful to perform a brute-force over all
+documents if you only care about 1 user at a time. So, you can partition the
+vector index based on user ID like so:
+
+```sql
+create virtual table vec_documents using vec0(
+  document_id integer primary key,
+  user_id integer partition key,
+  contents_embedding float[1024]
+)
+```
+
+Then during a KNN query, you can constrain results to a specific user in the
+`WHERE` clause like so:
+
+```sql
+select
+  document_id,
+  user_id,
+  distance
+from vec_documents
+where contents_embedding match :query
+  and k = 20
+  and user_id = 123;
+```
+
+`sqlite-vec` will recognize the `user_id = 123` constraint and pre-filter
+vectors during a KNN search. Vectors with the same partition key values are
+collocated together, so this is a fast operation.
+
+Another example: say you're performing vector search on a large dataset of news
+headlines of the past 100 years. However, in your application, most users only
+want to search a subset of articles based on when they were written, like "in
+the past ten years" or "during the obama administration." You can paritition
+based on published date like so:
+
+```sql
+create virtual table vec_articles using vec0(
+  article_id integer primary key,
+  published_date text partition key,
+  headline_embedding float[1024]
+);
+```
+
+And a KNN query:
+
+```sql
+select
+  article_id,
+  published_date,
+  distance
+from vec_articles
+where headline_embedding match :query
+  and published_date between '2009-01-20' and '2017-01-20'; -- Obama administration
+```
+
+But be careful! over-using partition key columns can lead to over-sharding and
+slower KNN queries. As a rule of thumb, make sure that every unique partition
+key value has ~100's of vectors associated with it. In the above examples, make
+sure that every user has on the magnitude of dozens or hundreds of documents
+each, or that every article has dozens or hundreds of articles per day. If they
+don't and you're noticing slow queries, try a more broad partition key value,
+like `organization_id` or `published_month`.
+
+A maximum of 4 partition key columns can be declared in a `vec0` virtual table,
+but use caution if you find yourself using more than 1. Vectors are sharded
+along each unique combination, so over-sharding is more common with more
+partition key columns.
+
+### Auxiliary Columns {#aux}
+
+Auxiliary columns store additional unindexed data separate from the internal
+vector index. They are meant for larger metadata that will never appear in a
+`WHERE` clause of a KNN query, eliminating the need for a separate `JOIN`.
+
+Auxiliary columns are denoted by a `+` prefix in their column definition, like
+so:
+
+```sql
+create virtual table vec_chunks using vec0(
+  contents_embedding float[1024],
+  +contents text
+);
+
+select
+  rowid,
+  contents,
+  distance
+from vec_chunks
+where contents_embedding match :query
+  and k = 10;
+```
+
+Here we store the text contents of each chunk in the `contents` auxiliary
+column. When we perform a KNN query, we can reference the `contents` column in
+the `SELECT` clause, to get the raw text contents of the most relevant chunks.
+
+A similar approach can be used for image embeddings:
+
+```sql
+create virtual table vec_image_chunks using vec0(
+  image_embedding float[1024],
+  +image blob
+);
+
+select
+  rowid,
+  contents,
+  distance
+from vec_chunks
+where contents_embedding match :query
+  and k = 10;
+```
+
+Here the `image` auxiliary column can store the raw image file in a large `BLOB`
+column. It can appear in the `SELECT` clause of the KNN query, to get the most
+relevant raw images.
+
+In general, auxiliary columns are good for large text, blobs, URLs, or other
+datatypes that won't be a part of a `WHERE` clause of a KNN query. If you column
+will often appear in a `SELECT` clause but not the `WHERE` clause, then
+auxiliary columns are a good fit.
+
+A maximum of 16 auxiliary columns can be declared in a `vec0` virtual table.