Support push-down filters on vss indices #2

asg017 · 2023-02-03T20:05:04Z

tl;dr "The Missing WHERE Clause in Vector Search" describes this exact problem

Problem

Say your database looks like this:

create table articles(
  headline text, 
  headline_embedding blob,
  newsroom text,
  published_at date
);

create virtual table vss_articles using vss(
  headline_embedding(384)
);

Most queries go like "show me the 10 closest articles that talk about 'global warming'", which is currently supported and easy:

select rowid
from vss_articles
where vss_search(headline_embedding, embedding_api('global warming'))
limit 10;

Now you have the rowids of the 10 closest articles that talk about "global warming", and you're good to go!

But now consider you want the same 10 closest articles, but with extra filters. Say "show me the 10 closest articles that talk about 'global warming', that published in 2016-2019 from the NY times".

.... well crap. You have a few options, but none will 100% get the correct answer. You could cast a wide net by getting the nearest 1,000 matches then filtering those:

Solution: indexed non-vector columns in `vss0`

create virtual table vss_articles using vss0(
  headline_embeddings(384),
  published_at date,
  newsroom text
);

insert into vss_articles
  select headline_embedding, published_at, newsroom
  from articles;

select rowid
from vss_articles
where vss_search(headline_embeddings, embedding_api('global warming'))
  and published_at between '2016-01-01' and '2019-12-31'
  and newsroom = 'NY Times'
limit 20;

Implementation

We could support non-vector columns in vss0, which get added to the vss_xyz_data shadow table. Then, we can create indices on those columns on the shadow tables like so:

create index idx_vss0_vss_articles_data_published_at 
  on vss_articles_data(published_at);
create index idx_vss0_vss_articles_data_newsroom 
  on vss_articles_data(newsroom);

Then, in xFilter, we determine "candidate vectors" by querying those new data columns using the new indicies. We'll get all the candidate rowids and pass it into SearchParameters->IDSelector, with maybe some perf addtions (bitmap, bloom filter, etc.).

The text was updated successfully, but these errors were encountered:

simonw · 2023-02-10T20:07:56Z

I solved this for datasette-faiss with an aggregate function that builds a single use in-memory FAISS index:

Add an aggregate function for querying a subset of rows simonw/datasette-faiss#3

This works really well if you have the discipline to run a filter to reduce the set of potential matches to ~1,000 before performing the aggregation, since assembling a 1,000 item FAISS index and then running a query against it seems to be pretty fast.

In https://www.pinecone.io/learn/vector-search-filtering/ terms this is pre-filtering.

Not sure if providing something like this would be worthwhile in sqlite-vss though - your "indexed non-vector columns in vss0" idea sounds a lot more robust to me.

turbo · 2023-02-12T13:17:24Z

One thing to keep in mind are hybrid queries and how they're implemented in other VSS capable databases. Taking your example:

select rowid
from vss_articles
where vss_search(headline_embeddings, embedding_api('global warming'))
  and published_at between '2016-01-01' and '2019-12-31'
  and newsroom = 'NY Times'
limit 20;

Here a simple value selection is used newsroom = 'NY Times', however, hybrid queries often rely on full-text functionality. Take this Redis query (using the RediSearch VSS syntax):

FT.SEARCH idx "(@title:Dune @num:[2020 2022])=>[KNN $K @vec $BLOB AS my_scores]" PARAMS 4 BLOB "\x12\xa9\xf5\x6c" K 10 SORTBY my_scores DIALECT 2

Here the 10 closest matches are selected, including a simple scalar pushdown filter (num), but also a FTS filter on field title (i.e. "any records where title contains Dune).

In Redis, FTS and VSS indexes are one and the same, or rather, VSS indexes can index any other record text field. In fact, VSS indexes can flatten multiple sub-fields that contain vectors but let's ignore that for now.

So my point is: whatever design is chosen in the end, it may be a good idea to include the ability to create FTS5 indexes as part of the VSS index as well, so true hybrid queries can be executed. Sans that, you'd have to run a separate FTS5 query first to select candidate values, and then include a subquery in the pushdown filter.

trodrigu · 2023-05-05T17:16:04Z

I really like this project and wanted to pitch in here if possible. I started reading through the code and learning the apis. Before going any further I wanted to reach out to you and check in. Thanks for making such a useful tool!

asg017 · 2023-05-05T19:04:58Z

Hey @trodrigu , thanks for the interest! Would definitely appreciate any help, but this specific PR is going to be a large change and might conflict with some other upcoming features.

But would still love to chat about how you can help - send me an email at alexsebastian [dot] garcia [at] gmail [dot] com and maybe we can setup a call?

In the meantime I'll try to writeup how I think this specific feature should be implemented

trodrigu · 2023-05-09T22:28:55Z

@asg017 No problem, I sent over an email.

asg017 · 2023-05-12T19:27:35Z

Okay I split this issue into two:

Support rowid in (...) constraints in vss_search() KNN queries #19, which is allowing rowid in (...) constraints with vss_search(). This would allow for pre-filtering, would be easy to implement, and is extensible. Downside is it might be slower that other approaches and might be a tad awkward
Storing non-vector column in vss0 virtual tables #20, which allows you to store non-vector columns (ie "metadata") in vss0 virtual tables. This will be much harder to implement and will take more time, but would be a much more natural solution.

In general, both of these would be "pre-filtering" solutions. You could currently do a "post-filter" yourself with the current API (with some SQL and CTEs), but it runs the risk of inaccurate/missing results. I don't know if it'll be possible to do "single-stage filtering" as described in the Pinecone blog post, due to limitations of using Faiss. At best Faiss has an IDSelector which is essentially a callback function for every searched vector that we can use to approve/deny individual vectors in KNN style searches. I think it's technically "pre-filtering" and not a "single-stage filter", but at that point it's mostly just semantics.

Will keep this thread open to discuss other push-down filter approaches, but I think #19 and #20 will be the main solutions for this.

metaskills · 2023-08-25T00:13:35Z

Thanks for this write up. Am I understanding it all correctly that the only way to do this now is to post-filter? For example, I have a product catalog that has ~24 categories. Like this:

SELECT * FROM products
WHERE ROWID IN (
  SELECT ROWID FROM vss_products
  WHERE vss_search(embedding, vss_search_params(:embedding, 500))
)
AND category_name = :category_name
LIMIT :limit

This works for me for now because the query is still very fast.

asg017 · 2023-08-25T16:40:26Z

Yes, that's the current way to do post-filtering, as of now. Just keep in mind that it won't always return the correct number of records,, because the closest 500 may not include all rows where category_name = :category_name.

I will hopefully get #19 in the next release, which will help with easier pre-filtering!

This was referenced May 12, 2023

Support rowid in (...) constraints in vss_search() KNN queries #19

Open

Storing non-vector column in vss0 virtual tables #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support push-down filters on vss indices #2

Support push-down filters on vss indices #2

asg017 commented Feb 3, 2023

simonw commented Feb 10, 2023 •

edited

Loading

turbo commented Feb 12, 2023

trodrigu commented May 5, 2023

asg017 commented May 5, 2023

trodrigu commented May 9, 2023

asg017 commented May 12, 2023

metaskills commented Aug 25, 2023 •

edited

Loading

asg017 commented Aug 25, 2023

Support push-down filters on vss indices #2

Support push-down filters on vss indices #2

Comments

asg017 commented Feb 3, 2023

Problem

Solution: indexed non-vector columns in vss0

Implementation

simonw commented Feb 10, 2023 • edited Loading

turbo commented Feb 12, 2023

trodrigu commented May 5, 2023

asg017 commented May 5, 2023

trodrigu commented May 9, 2023

asg017 commented May 12, 2023

metaskills commented Aug 25, 2023 • edited Loading

asg017 commented Aug 25, 2023

Solution: indexed non-vector columns in `vss0`

simonw commented Feb 10, 2023 •

edited

Loading

metaskills commented Aug 25, 2023 •

edited

Loading