Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support multivector type #3190

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Dec 2, 2024

@github-actions github-actions bot added the enhancement New feature or request label Dec 2, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 79.27461% with 80 lines in your changes missing coverage. Please review.

Project coverage is 78.57%. Comparing base (7ec23f0) to head (77cab91).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 63.51% 14 Missing and 13 partials ⚠️
rust/lance/src/index/vector/utils.rs 56.60% 21 Missing and 2 partials ⚠️
rust/lance-index/src/vector/transform.rs 74.35% 8 Missing and 2 partials ⚠️
rust/lance-index/src/vector/flat.rs 52.94% 5 Missing and 3 partials ⚠️
rust/lance-linalg/src/distance.rs 94.00% 3 Missing ⚠️
rust/lance/src/index/vector/ivf.rs 0.00% 0 Missing and 2 partials ⚠️
rust/lance/src/io/exec/knn.rs 0.00% 0 Missing and 2 partials ⚠️
rust/lance-index/src/vector/flat/index.rs 50.00% 1 Missing ⚠️
rust/lance/src/index.rs 50.00% 0 Missing and 1 partial ⚠️
rust/lance/src/index/vector/builder.rs 0.00% 0 Missing and 1 partial ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3190      +/-   ##
==========================================
+ Coverage   78.48%   78.57%   +0.08%     
==========================================
  Files         245      245              
  Lines       84998    85335     +337     
  Branches    84998    85335     +337     
==========================================
+ Hits        66707    67048     +341     
+ Misses      15473    15453      -20     
- Partials     2818     2834      +16     
Flag Coverage Δ
unittests 78.57% <79.27%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>

let mut knn_node = if q.refine_factor.is_some() {
let mut knn_node = if q.refine_factor.is_some() || is_multivec {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for multivector, refine is always required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont follow, why is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just follows the algo that colbert paper described, this is required for calculating the maxsim distance. without refine, the search just finds nearest chunks without considering maxsim metric

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal marked this pull request as ready for review December 16, 2024 08:38
@BubbleCal BubbleCal requested a review from westonpace December 16, 2024 08:39
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let (_, element_type) = get_vector_type(self.dataset.schema(), column)?;
let dim = get_vector_dim(self.dataset.schema(), column)?;
// make sure the query is valid
if q.len() % dim != 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for vectors, q.len() == dim
for multivectors, q.len() % dim == 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if q is multi vectors as well, shouldn't it to be a FixedSizeList

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i noticed our async API has supported query with batch vectors, so passing a list of vectors as query is now for that.
so querying with a multivector is by passing a flatten array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not design API like this. From users, it is confusing / got them by surprise, that they need use flatten array.

We can detect this route to use multi-vector query + the column is list<fsl> ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, we can't. for multivector column, the query can be any number of vectors. say query = [vec1, vec2, ..., vec_n] can mean both:

  1. n queries, each with single vector
  2. single query with n vectors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants