Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: start recording index details in the mainifest, cache index type lookup #3131

Merged
merged 2 commits into from
Nov 16, 2024

Conversation

westonpace
Copy link
Contributor

This addresses a specific problem. When a dataset had a scalar index on a string column we would perform I/O during the planning phase on every query that contained a filter. This added considerably latency (especially against S3) to query times.

We now cache that lookup.

It also starts to tackle a more central problem as well. Right now we our manifest stores very little information about indices (pretty much just the UUID). Any further information must be obtained by loading the index. This PR introduces the concept of "index details" which is a spot that an index can put index-specific (e.g. specific to btree or specific to bitmap) information that can be accessed during planning (by just looking at the manifest). At the moment this concept is still fairly bare bones but I think, as scalar indices become more sophisticated, this information can be useful.

If we decide we don't want it then I can pull it out as well and dial this PR back to just the caching component.

…pecific metadata that

can be used at planning time to determine if an index should be applied (without paying the
I/O cost to load the index).
@github-actions github-actions bot added enhancement New feature or request python labels Nov 15, 2024
@wjones127
Copy link
Contributor

We had discussed earlier some similar index changes proposed here:

lancedb/lancedb#1666

It looks like this is a good step in that direction by adding the index_config / index_details field 👍

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 66.05505% with 37 lines in your changes missing coverage. Please review.

Project coverage is 77.90%. Comparing base (f257489) to head (e481fd4).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/index/scalar.rs 67.34% 12 Missing and 4 partials ⚠️
rust/lance/src/index/cache.rs 9.09% 10 Missing ⚠️
rust/lance/src/index.rs 59.09% 4 Missing and 5 partials ⚠️
rust/lance/src/dataset/scanner.rs 95.00% 1 Missing ⚠️
rust/lance/src/io/commit.rs 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3131      +/-   ##
==========================================
- Coverage   77.91%   77.90%   -0.01%     
==========================================
  Files         240      240              
  Lines       81564    81459     -105     
  Branches    81564    81459     -105     
==========================================
- Hits        63550    63464      -86     
- Misses      14806    14815       +9     
+ Partials     3208     3180      -28     
Flag Coverage Δ
unittests 77.90% <66.05%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@westonpace westonpace merged commit a212395 into lancedb:main Nov 16, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants