-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add CreateIndex commit type to python API #2883
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
f5dba08
add CreateInde commit type to python
jiachengdb 7353bc8
black format
jiachengdb 7a03e98
add license
jiachengdb d621da7
add fragment bitmap
jiachengdb 454d0d4
cargo fmt
jiachengdb 2cbacc0
change to set
jiachengdb 6ace748
update test
jiachengdb 0a3ae9f
Fix lint
jiachengdb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# SPDX-FileCopyrightText: Copyright The Lance Authors | ||
|
||
import random | ||
import shutil | ||
import string | ||
from pathlib import Path | ||
|
||
import lance | ||
import numpy as np | ||
import pyarrow as pa | ||
import pytest | ||
|
||
|
||
@pytest.fixture() | ||
def test_table(): | ||
num_rows = 1000 | ||
price = np.random.rand(num_rows) * 100 | ||
|
||
def gen_str(n, split="", char_set=string.ascii_letters + string.digits): | ||
return "".join(random.choices(char_set, k=n)) | ||
|
||
meta = np.array([gen_str(100) for _ in range(num_rows)]) | ||
doc = [gen_str(10, " ", string.ascii_letters) for _ in range(num_rows)] | ||
tbl = pa.Table.from_arrays( | ||
[ | ||
pa.array(price), | ||
pa.array(meta), | ||
pa.array(doc, pa.large_string()), | ||
pa.array(range(num_rows)), | ||
], | ||
names=["price", "meta", "doc", "id"], | ||
) | ||
return tbl | ||
|
||
|
||
@pytest.fixture() | ||
def dataset_with_index(test_table, tmp_path): | ||
dataset = lance.write_dataset(test_table, tmp_path) | ||
dataset.create_scalar_index("meta", index_type="BTREE") | ||
return dataset | ||
|
||
|
||
def test_commit_index(dataset_with_index, test_table, tmp_path): | ||
index_id = dataset_with_index.list_indices()[0]["uuid"] | ||
|
||
# Create a new dataset without index | ||
dataset_without_index = lance.write_dataset( | ||
test_table, tmp_path / "dataset_without_index" | ||
) | ||
|
||
# Copy the index from dataset_with_index to dataset_without_index | ||
src_index_dir = Path(dataset_with_index.uri) / "_indices" / index_id | ||
dest_index_dir = Path(dataset_without_index.uri) / "_indices" / index_id | ||
shutil.copytree(src_index_dir, dest_index_dir) | ||
|
||
# Commit the index to dataset_without_index | ||
field_idx = dataset_without_index.schema.get_field_index("meta") | ||
create_index_op = lance.LanceOperation.CreateIndex( | ||
index_id, | ||
"meta_idx", | ||
[field_idx], | ||
dataset_without_index.version, | ||
set([f.fragment_id for f in dataset_without_index.get_fragments()]), | ||
) | ||
dataset_without_index = lance.LanceDataset.commit( | ||
dataset_without_index.uri, | ||
create_index_op, | ||
read_version=dataset_without_index.version, | ||
) | ||
|
||
# Verify that both datasets have the index | ||
assert len(dataset_with_index.list_indices()) == 1 | ||
assert len(dataset_without_index.list_indices()) == 1 | ||
|
||
assert ( | ||
dataset_without_index.list_indices()[0] == dataset_with_index.list_indices()[0] | ||
) | ||
|
||
# Check if the index is used in scans | ||
for dataset in [dataset_with_index, dataset_without_index]: | ||
scanner = dataset.scanner( | ||
fast_search=True, prefilter=True, filter="meta = 'hello'" | ||
) | ||
plan = scanner.explain_plan() | ||
assert "MaterializeIndex" in plan | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of shocked it is if you don't pass down the fragment bitmap. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works because there is a fallback that scans the full index to recalculate this. It's much preferable to be able to pass down
fragment_bitmap
, as this scan could be slow for large indices.