Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add CreateIndex commit type to python API #2883

Merged
merged 8 commits into from
Sep 16, 2024

Conversation

jiachengdb
Copy link
Contributor

No description provided.

@github-actions github-actions bot added enhancement New feature or request python labels Sep 13, 2024
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I appreciate the well-thought-out test.

However, we should make sure to handle fragment_bitmap. It plays an important role in the query path and maintenance of the index.

name,
fields,
dataset_version,
fragment_bitmap: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fragment_bitmap is essential for the maintenance of indices. We shouldn't leave it blank. You can set it to be a set() in Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I added fragment_ids to the python api.

I guess in normal case, we should just pass the fragment_ids of all fragments to it for index built over a static dataset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in normal case, we should just pass the fragment_ids of all fragments to it for index built over a static dataset?

Yes that's accurate. There are other cases, such as when another process has appended during the index operation, or the index operation was only performed on a subset of data.

Comment on lines +74 to +86
# Check if the index is used in scans
for dataset in [dataset_with_index, dataset_without_index]:
scanner = dataset.scanner(
fast_search=True, prefilter=True, filter="meta = 'hello'"
)
plan = scanner.explain_plan()
assert "MaterializeIndex" in plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of shocked it is if you don't pass down the fragment bitmap. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works because there is a fallback that scans the full index to recalculate this. It's much preferable to be able to pass down fragment_bitmap, as this scan could be slow for large indices.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.89%. Comparing base (c0e1f15) to head (8d48a6d).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2883      +/-   ##
==========================================
- Coverage   78.12%   77.89%   -0.24%     
==========================================
  Files         229      231       +2     
  Lines       70492    70523      +31     
  Branches    70492    70523      +31     
==========================================
- Hits        55073    54932     -141     
- Misses      12305    12459     +154     
- Partials     3114     3132      +18     
Flag Coverage Δ
unittests 77.89% <ø> (-0.24%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the fragment ids. Could we make that a set? Once that's done I think this is good to go.

name: str
fields: List[int]
dataset_version: int
fragment_ids: List[int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer set here.

Suggested change
fragment_ids: List[int]
fragment_ids: Set[int]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

name: String,
fields: Vec<i32>,
dataset_version: u64,
fragment_ids: Vec<u32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fragment_ids: Vec<u32>,
fragment_ids: &PySet,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@wjones127 wjones127 merged commit 4ce680b into lancedb:main Sep 16, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants