-
Notifications
You must be signed in to change notification settings - Fork 538
feat(java): support building vector index distributively #5664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2f84c6d to
49b415e
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
3bef5b6 to
ee05fa4
Compare
ee05fa4 to
9e2273f
Compare
22cba7b to
0007d9e
Compare
0007d9e to
8735390
Compare
yanghua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
rust/lance/src/dataset.rs
Outdated
| let store = LanceIndexStore::from_dataset_for_new(self, index_uuid)?; | ||
| let index_dir = self.indices_dir().child(index_uuid); | ||
| log::info!( | ||
| "merge_index_metadata called with index_type={})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to add some logging, make it more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry.
This is just copied from original python dataset.rs. I agree to delete this info. WDYT
java/lance-jni/src/vector_trainer.rs
Outdated
| // descriptive error if the column is invalid. | ||
| let _dim = get_vector_dim(dataset.schema(), &column)?; | ||
|
|
||
| // Drop dataset_guard at the end of this block before reusing env. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless?
PTAL when you have time @yanghua |
yanghua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two comments, otherwise LGTM.
| import static org.junit.jupiter.api.Assertions.assertFalse; | ||
| import static org.junit.jupiter.api.Assertions.assertTrue; | ||
|
|
||
| public class VectorIndexTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can compare the recall performance between a single-machine built index and a distributedly built index? Just like Python. Currently, all the test cases do not measure the metrics for the recall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could verify the accuracy of the recall implementation just once — since both Java and Python call the same underlying rust code. I‘d like to keep Java tests more functional-only
What do you think? @yanghua
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since both Java and Python call the same underlying rust code
Maybe we can choose one case to verify? Although the underlying code is the same, the upper layer still needs to do some data type change or parameters pass. They may cause some issues?
yanghua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have discussed with @majin1102. And we have agreed that we can add more detailed verification about recall in the following PR. So, LGTM about this PR!
…t#5664) Note: 1. Reposition merge_index_metadata so that Java dataset could directly call mergeIndexMetadata for vector indexes. 2. Add `centroids` to IvfBuilderParams and `codebook` to PqBuilderParams to make distrubitively creating vector indexes work 3. Add a new index package under tests to have index tests. 4. This PR update Cargo.lock due to the intruduction of lance-format#5650.
Note:
centroidsto IvfBuilderParams andcodebookto PqBuilderParams to make distrubitively creating vector indexes work