-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: allow fragment scan for nearest query if there is a prefilter #2631
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2631 +/- ##
==========================================
+ Coverage 79.13% 79.15% +0.01%
==========================================
Files 218 218
Lines 63615 63617 +2
Branches 63615 63617 +2
==========================================
+ Hits 50344 50354 +10
+ Misses 10324 10322 -2
+ Partials 2947 2941 -6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work today for filters on indexed columns. For example, if the filter is my_uuid == 17
or my_uuid != 17
and there is a scalar index on my_uuid
. In this case the prefilter will be an allow list or block list with one row id.
However, we should be able to enhance the indexed prefilter logic to apply an allow list or block list based on excluded fragments.
Got it. Is there an easy way to extend the validation logic in this PR to only allow fragment scan if the leaf is a FileScan? |
ba65a81
to
ab62569
Compare
@westonpace I added another check on the index scan to disallow the case you mentioned. PTAL. |
rust/lance/src/dataset/scanner.rs
Outdated
// Index scan doesn't honor the fragment allowlist today. | ||
// TODO: we could filter the index scan results to only include the allowed fragments. | ||
self.ensure_not_fragment_scan()?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to this spot (it's line 1539 on main right now I think):
(Some(index_query), None, true) => {
// The filter is completely satisfied by the index. We
// only need to search the index to determine the valid row
// ids.
let index_query = Arc::new(ScalarIndexExec::new(
self.dataset.clone(),
index_query.clone(),
));
PreFilterSource::ScalarIndexQuery(index_query)
}
The only node that doesn't handle a fragment scan is the ScalarIndexQuery
node. So right here is a bit too restrictive. If we do that then we're good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to line 1539. PTAL.
ab62569
to
44eb5e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bearing with me.
I'm going to merge this but if you find time can you add a PR with a python unit test covering this case? |
Let me find time to add the test. |
|
We want to create a query plan that first runs prefiltering on selected fragments.
Then for rows passing the prefilter, we probe the ANN index with
filter_row_ids
.In this plan, the returned result rows will only come from the selected fragments, so there will be no violation to the query semantics.