-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support multivector type #3190
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3190 +/- ##
==========================================
+ Coverage 78.48% 78.57% +0.08%
==========================================
Files 245 245
Lines 84998 85335 +337
Branches 84998 85335 +337
==========================================
+ Hits 66707 67048 +341
+ Misses 15473 15453 -20
- Partials 2818 2834 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
55b0e08
to
a171aa6
Compare
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
|
||
let mut knn_node = if q.refine_factor.is_some() { | ||
let mut knn_node = if q.refine_factor.is_some() || is_multivec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for multivector, refine is always required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont follow, why is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this just follows the algo that colbert paper described, this is required for calculating the maxsim distance. without refine, the search just finds nearest chunks without considering maxsim metric
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let (_, element_type) = get_vector_type(self.dataset.schema(), column)?; | ||
let dim = get_vector_dim(self.dataset.schema(), column)?; | ||
// make sure the query is valid | ||
if q.len() % dim != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for vectors, q.len() == dim
for multivectors, q.len() % dim == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if q
is multi vectors as well, shouldn't it to be a FixedSizeList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i noticed our async API has supported query with batch vectors, so passing a list of vectors as query is now for that.
so querying with a multivector is by passing a flatten array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not design API like this. From users, it is confusing / got them by surprise, that they need use flatten array.
We can detect this route to use multi-vector query + the column is list<fsl>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, we can't. for multivector column, the query can be any number of vectors. say query = [vec1, vec2, ..., vec_n]
can mean both:
- n queries, each with single vector
- single query with n vectors
related to lancedb/lancedb#1838