-
Notifications
You must be signed in to change notification settings - Fork 137
Add option for index-time filtering to knnPerfTest.py #468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BenchmarksCohere vectors, 768d, 100K docs, 100K docs, 200K docs, 200K docs, 500K docs, 500K docs, Couple of things to note:
The benefits of index-time filtering seem to improve with higher number of docs + more selective filters |
# Conflicts: # src/python/knnPerfTest.py
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome -- I left some minor polishing type comments!
| FLAT | ||
| } | ||
|
|
||
| enum FilterStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooooh ... love it. I wish Lucene had this enum / query+filter optimizer.
src/main/knn/KnnGraphTester.java
Outdated
| break; | ||
| case "-filterStrategy": | ||
| if (iarg == args.length - 1) { | ||
| throw new IllegalArgumentException("-filterStrategy requires a following pathname"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathname -> strategy string (query-time-pre-filter, query-time-post-filter, or index-time-filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Oops sorry, you meant the error message -- done!
src/main/knn/KnnGraphTester.java
Outdated
| case "query-time-pre-filter" -> FilterStrategy.QUERY_TIME_PRE_FILTER; | ||
| case "query-time-post-filter" -> FilterStrategy.QUERY_TIME_POST_FILTER; | ||
| case "index-time-filter" -> FilterStrategy.INDEX_TIME_FILTER; | ||
| default -> throw new IllegalArgumentException("-filterStrategy can be 'query-time-pre-filter' or 'query-time-post-filter' or 'index-time-filter' only"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of can be .... only say must be one of?
And can you include in the error message which (invalid) strategy string the user passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, added!
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you! We somehow need nightly benchy to test filtered KNN with these three strategies... maybe open spinoff issue? Thank @kaivalnp!
|
Thanks @mikemccand, I opened #473 |
|
Hmm nightly benchy is angry: I think this happens if you specify no filter strategy? I'll make a fix -- I think we just need a null check when we print the summary. Also probably need to fix |
Spinoff from apache/lucene#14758
Add an option for index-time filtering of vectors to
knnPerfTest.pyIndex-time filtering means creating a separate HNSW graph for filters known at index time, by passing the same vector to Lucene under a different vector field name. This type of filtering may be beneficial to some users willing to move cost of filtering upfront to indexing (both time and storage).
Right now, Lucene stores copies of the duplicated vectors into different fields on disk -- and we're aiming to reduce this duplication in the linked Lucene issue (feedback welcome)!
KnnGraphTesteralready had a-filterSelectivityand-prefilteroption to simulate pre and post-filtering -- I've modified that a bit to change options to:-filterStrategy(one ofquery-time-pre-filter,query-time-post-filter,index-time-filter) and-filterSelectivity(between 0 and 1, non-inclusive)KnnGraphTesternow expects either both, or none of the above parameters to be passed at the same time -- and does not perform any filtering if these values are not specified. Whenindex-time-filteris used, it adds an additional vector fieldknn-filteredto the index, and uses this smaller field at search-time. The other two options correspond to the-prefilterboolean flag being present (or not) before this PR.I'll add some benchmarks soon!