Skip to content

Conversation

@majin1102
Copy link
Contributor

@majin1102 majin1102 commented Jan 8, 2026

Note:

  1. Reposition merge_index_metadata so that Java dataset could directly call mergeIndexMetadata for vector indexes.
  2. Add centroids to IvfBuilderParams and codebook to PqBuilderParams to make distrubitively creating vector indexes work
  3. Add a new index package under tests to have index tests.
  4. This PR update Cargo.lock due to the intruduction of perf: improve FTS indexing perf and reduce memory footprint #5650.

@github-actions github-actions bot added enhancement New feature or request python java labels Jan 8, 2026
@majin1102 majin1102 marked this pull request as draft January 8, 2026 16:45
@codecov
Copy link

codecov bot commented Jan 9, 2026

Codecov Report

❌ Patch coverage is 0% with 48 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 0.00% 37 Missing ⚠️
rust/lance-index/src/lib.rs 0.00% 2 Missing and 7 partials ⚠️
rust/lance/src/index/vector/ivf.rs 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@majin1102 majin1102 changed the title feat(java): support merge metadata of vector index feat(java): support building vector index distributively Jan 9, 2026
@majin1102 majin1102 assigned majin1102 and unassigned majin1102 Jan 11, 2026
@majin1102 majin1102 marked this pull request as ready for review January 11, 2026 04:07
@majin1102 majin1102 requested a review from yanghua January 12, 2026 08:44
Copy link
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

let store = LanceIndexStore::from_dataset_for_new(self, index_uuid)?;
let index_dir = self.indices_dir().child(index_uuid);
log::info!(
"merge_index_metadata called with index_type={})",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to add some logging, make it more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry.
This is just copied from original python dataset.rs. I agree to delete this info. WDYT

// descriptive error if the column is invalid.
let _dim = get_vector_dim(dataset.schema(), &column)?;

// Drop dataset_guard at the end of this block before reusing env.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless?

@majin1102
Copy link
Contributor Author

majin1102 commented Jan 21, 2026

  1. remove unnecessary info code
  2. remove unnecessary trainSq

PTAL when you have time @yanghua

Copy link
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two comments, otherwise LGTM.

import static org.junit.jupiter.api.Assertions.assertFalse;
import static org.junit.jupiter.api.Assertions.assertTrue;

public class VectorIndexTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can compare the recall performance between a single-machine built index and a distributedly built index? Just like Python. Currently, all the test cases do not measure the metrics for the recall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could verify the accuracy of the recall implementation just once — since both Java and Python call the same underlying rust code. I‘d like to keep Java tests more functional-only

What do you think? @yanghua

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since both Java and Python call the same underlying rust code

Maybe we can choose one case to verify? Although the underlying code is the same, the upper layer still needs to do some data type change or parameters pass. They may cause some issues?

Copy link
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have discussed with @majin1102. And we have agreed that we can add more detailed verification about recall in the following PR. So, LGTM about this PR!

@yanghua yanghua merged commit ae8add8 into lance-format:main Jan 21, 2026
41 of 42 checks passed
majin1102 added a commit to majin1102/lance that referenced this pull request Jan 23, 2026
…t#5664)

Note:
1. Reposition merge_index_metadata so that Java dataset could directly
call mergeIndexMetadata for vector indexes.
2. Add `centroids` to IvfBuilderParams and `codebook` to PqBuilderParams
to make distrubitively creating vector indexes work
3. Add a new index package under tests to have index tests.
4. This PR update Cargo.lock due to the intruduction of lance-format#5650.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants